Semantic segmentation with LRASPP MobileNet v3 and OpenVINO

This Jupyter notebook can be launched on-line, opening an interactive environment in a browser window. You can also make a local installation. Choose one of the following options:

BinderGoogle ColabGithub

The torchvision.models subpackage contains definitions of models for addressing different tasks, including: image classification, pixelwise semantic segmentation, object detection, instance segmentation, person keypoint detection, video classification, and optical flow. Throughout this notebook we will show how to use one of them. The LRASPP model is based on the Searching for MobileNetV3 paper. According to the paper, Searching for MobileNetV3, LR-ASPP or Lite Reduced Atrous Spatial Pyramid Pooling has a lightweight and efficient segmentation decoder architecture. he model is pre-trained on the MS COCO dataset. Instead of training on all 80 classes, the segmentation model has been trained on 20 classes from the PASCAL VOC dataset: background, aeroplane, bicycle, bird, boat, bottle, bus, car, cat, chair, cow, dining table, dog, horse, motorbike, person, potted plant, sheep, sofa, train, tv monitor

More information about the model is available in the torchvision documentation

Table of contents:

Prerequisites

%pip install -q --extra-index-url https://download.pytorch.org/whl/cpu torch torchvision
%pip install -q matplotlib
%pip install -q "openvino>=2023.2.0"
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
from pathlib import Path

import openvino as ov
import torch

Get a test image

First of all lets get a test image from an open dataset.

import urllib.request

from torchvision.io import read_image
import torchvision.transforms as transforms


img_path = 'cats_image.jpeg'
urllib.request.urlretrieve(
    url='https://huggingface.co/datasets/huggingface/cats-image/resolve/main/cats_image.jpeg',
    filename=img_path
)
image = read_image(img_path)
display(transforms.ToPILImage()(image))
../_images/125-lraspp-segmentation-with-output_5_0.png

Download and prepare a model

Define width and height of the image that will be used by the network during inference. According to the input transforms function, the model is pre-trained on images with a height of 480 and width of 640.

IMAGE_WIDTH = 640
IMAGE_HEIGHT = 480

Torchvision provides a mechanism of listing and retrieving available models.

import torchvision.models as models

# List available models
all_models = models.list_models()
# List of models by type
segmentation_models = models.list_models(module=models.segmentation)

print(segmentation_models)
['deeplabv3_mobilenet_v3_large', 'deeplabv3_resnet101', 'deeplabv3_resnet50', 'fcn_resnet101', 'fcn_resnet50', 'lraspp_mobilenet_v3_large']

We will use lraspp_mobilenet_v3_large. You can get a model by name using models.get_model("lraspp_mobilenet_v3_large", weights='DEFAULT') or call a corresponding function directly. We will use torchvision.models.segmentation.lraspp_mobilenet_v3_large. You can directly pass pre-trained model weights to the model initialization function using weights enum LRASPP_MobileNet_V3_Large_Weights.COCO_WITH_VOC_LABELS_V1. It is a default weights. To get all available weights for the model you can call weights_enum = models.get_model_weights("lraspp_mobilenet_v3_large"), but there is only one for this model.

weights = models.segmentation.LRASPP_MobileNet_V3_Large_Weights.COCO_WITH_VOC_LABELS_V1
model = models.segmentation.lraspp_mobilenet_v3_large(weights=weights)

Define a preprocessing and prepare an input data

You can use torchvision.transforms to make a preprocessing or usepreprocessing transforms from the model wight.

import numpy as np


preprocess = models.segmentation.LRASPP_MobileNet_V3_Large_Weights.COCO_WITH_VOC_LABELS_V1.transforms()
preprocess.resize_size = (IMAGE_HEIGHT, IMAGE_WIDTH)  # change to an image size

input_data = preprocess(image)
input_data = np.expand_dims(input_data, axis=0)

Run an inference on the PyTorch model

model.eval()
with torch.no_grad():
    result_torch = model(torch.as_tensor(input_data).float())['out']

Convert the original model to OpenVINO IR Format

To convert the original model to OpenVINO IR with FP16 precision, use model conversion API. The models are saved inside the current directory. For more information on how to convert models, see this page.

ov_model_xml_path = Path('models/ov_lraspp_model.xml')


if not ov_model_xml_path.exists():
    ov_model_xml_path.parent.mkdir(parents=True, exist_ok=True)
    dummy_input = torch.randn(1, 3, IMAGE_HEIGHT, IMAGE_WIDTH)
    ov_model = ov.convert_model(model, example_input=dummy_input)
    ov.save_model(ov_model, ov_model_xml_path)
else:
    print(f"IR model {ov_model_xml_path} already exists.")

Run an inference on the OpenVINO model

Select device from dropdown list for running inference using OpenVINO

import ipywidgets as widgets

core = ov.Core()
device = widgets.Dropdown(
    options=core.available_devices + ["AUTO"],
    value='AUTO',
    description='Device:',
    disabled=False,
)

device
Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')

Run an inference

compiled_model = core.compile_model(ov_model_xml_path, device_name=device.value)
res_ir = compiled_model(input_data)[0]

Show results

Confirm that the segmentation results look as expected by comparing model predictions on the OpenVINO IR and PyTorch models.

You can use pytorch tutorial to visualize segmentation masks. Below is a simple example how to visualize the image with a cat mask for the PyTorch model.

import torch
import matplotlib.pyplot as plt

import torchvision.transforms.functional as F


plt.rcParams["savefig.bbox"] = 'tight'


def show(imgs):
    if not isinstance(imgs, list):
        imgs = [imgs]
    fix, axs = plt.subplots(ncols=len(imgs), squeeze=False)
    for i, img in enumerate(imgs):
        img = img.detach()
        img = F.to_pil_image(img)
        axs[0, i].imshow(np.asarray(img))
        axs[0, i].set(xticklabels=[], yticklabels=[], xticks=[], yticks=[])

Prepare and display a cat mask.

sem_classes = [
    '__background__', 'aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus',
    'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike',
    'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor'
]
sem_class_to_idx = {cls: idx for (idx, cls) in enumerate(sem_classes)}

normalized_mask = torch.nn.functional.softmax(result_torch, dim=1)

cat_mask = normalized_mask[0, sem_class_to_idx['cat']]

show(cat_mask)
../_images/125-lraspp-segmentation-with-output_28_0.png

The draw_segmentation_masks()function can be used to plots those masks on top of the original image. This function expects the masks to be boolean masks, but our masks above contain probabilities in [0, 1]. To get boolean masks, we can do the following:

class_dim = 1
boolean_cat_mask = (normalized_mask.argmax(class_dim) == sem_class_to_idx['cat'])

And now we can plot a boolean mask on top of the original image.

from torchvision.utils import draw_segmentation_masks

show(draw_segmentation_masks(image, masks=boolean_cat_mask, alpha=0.7, colors='yellow'))
../_images/125-lraspp-segmentation-with-output_32_0.png

Show results for the OpenVINO IR model

normalized_mask = torch.nn.functional.softmax(torch.from_numpy(res_ir), dim=1)
boolean_cat_mask = (normalized_mask.argmax(class_dim) == sem_class_to_idx['cat'])
show(draw_segmentation_masks(image, masks=boolean_cat_mask, alpha=0.7, colors='yellow'))
../_images/125-lraspp-segmentation-with-output_34_0.png