Automatic Device Selection with OpenVINO™

This tutorial is also available as a Jupyter notebook that can be cloned directly from GitHub. See the installation guide for instructions to run this tutorial locally on Windows, Linux or macOS.

Github

The Auto device (or AUTO in short) selects the most suitable device for inference by considering the model precision, power efficiency and processing capability of the available compute devices. The model precision (i.e. FP32, FP16, INT8, etc.) is the first consideration to filter out the devices that cannot run the network efficiently.

Next, if dedicated accelerators are available, these devices are preferred (e.g. integrated and discrete GPU or VPU). CPU is used as the default “fallback device”. Please note that AUTO makes this selection only once at the model load time.

When using accelerator device like GPUs, loading models to these devices may take a long time. To address this challenge for applications that require fast first inference response, AUTO starts inferencing immediately on the CPU and then transparently shifts inferencing to the GPU once it is ready. This dramatically reduces the time to first inference.

Download and convert the model

This tutorials uses the googlenet-v1 model from Open Model Zoo. The googlenet-v1 model is the first of the Inception family of models designed to perform image classification. Like other Inception models, googlenet-v1 was pre-trained on the ImageNet data set. For more details about this family of models, check out the research paper.

The following code downloads googlenet-v1 and converts it to OpenVINO IR format (model/public/googlenet-v1/FP16/googlenet-v1.xml). For more information about Open Model Zoo tools, please refer to the 104-model-tools tutorial.

from pathlib import Path
from IPython.display import Markdown, display

model_name = "googlenet-v1"
base_model_dir = Path("./model").expanduser()
precision = "FP16"

download_command = (
    f"omz_downloader --name {model_name} --output_dir {base_model_dir}"
)
display(Markdown(f"Download command: `{download_command}`"))
display(Markdown(f"Downloading {model_name}..."))

# For connections that require a proxy server
# Uncomment the following two lines and add the correct proxy addresses (if they are required).
# %env https_proxy=http://proxy
# %env http_proxy=http://proxy

! $download_command

convert_command = f"omz_converter --name {model_name} --precisions {precision} --download_dir {base_model_dir}"
display(Markdown(f"Convert command: `{convert_command}`"))
display(Markdown(f"Converting {model_name}..."))

! $convert_command

Download command: omz_downloader --name googlenet-v1 --output_dir model

Downloading googlenet-v1…

################|| Downloading googlenet-v1 ||################

========== Downloading model/public/googlenet-v1/googlenet-v1.prototxt


========== Downloading model/public/googlenet-v1/googlenet-v1.caffemodel


========== Replacing text in model/public/googlenet-v1/googlenet-v1.prototxt

Convert command: omz_converter --name googlenet-v1 --precisions FP16 --download_dir model

Converting googlenet-v1…

========== Converting googlenet-v1 to IR (FP16)
Conversion command: /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-188/.workspace/scm/ov-notebook/.venv/bin/python -- /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-188/.workspace/scm/ov-notebook/.venv/bin/mo --framework=caffe --data_type=FP16 --output_dir=model/public/googlenet-v1/FP16 --model_name=googlenet-v1 --input=data '--mean_values=data[104.0,117.0,123.0]' --output=prob --input_model=model/public/googlenet-v1/googlenet-v1.caffemodel --input_proto=model/public/googlenet-v1/googlenet-v1.prototxt '--layout=data(NCHW)' '--input_shape=[1, 3, 224, 224]'

Model Optimizer arguments:
Common parameters:
    - Path to the Input Model:  /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-188/.workspace/scm/ov-notebook/notebooks/106-auto-device/model/public/googlenet-v1/googlenet-v1.caffemodel
    - Path for generated IR:    /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-188/.workspace/scm/ov-notebook/notebooks/106-auto-device/model/public/googlenet-v1/FP16
    - IR output name:   googlenet-v1
    - Log level:    ERROR
    - Batch:    Not specified, inherited from the model
    - Input layers:     data
    - Output layers:    prob
    - Input shapes:     [1, 3, 224, 224]
    - Source layout:    Not specified
    - Target layout:    Not specified
    - Layout:   data(NCHW)
    - Mean values:  data[104.0,117.0,123.0]
    - Scale values:     Not specified
    - Scale factor:     Not specified
    - Precision of IR:  FP16
    - Enable fusing:    True
    - User transformations:     Not specified
    - Reverse input channels:   False
    - Enable IR generation for fixed input shape:   False
    - Use the transformations config file:  None
Advanced parameters:
    - Force the usage of legacy Frontend of Model Optimizer for model conversion into IR:   False
    - Force the usage of new Frontend of Model Optimizer for model conversion into IR:  False
Caffe specific parameters:
    - Path to Python Caffe* parser generated from caffe.proto:  /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-188/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/openvino/tools/mo/utils/../front/caffe/proto
    - Enable resnet optimization:   True
    - Path to the Input prototxt:   /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-188/.workspace/scm/ov-notebook/notebooks/106-auto-device/model/public/googlenet-v1/googlenet-v1.prototxt
    - Path to CustomLayersMapping.xml:  /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-188/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/openvino/tools/mo/utils/../../extensions/front/caffe/CustomLayersMapping.xml
    - Path to a mean file:  Not specified
    - Offsets for a mean file:  Not specified
OpenVINO runtime found in:  /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-188/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/openvino
OpenVINO runtime version:   2022.1.0-7019-cdb9bec7210-releases/2022/1
Model Optimizer version:    2022.1.0-7019-cdb9bec7210-releases/2022/1
[ SUCCESS ] Generated IR version 11 model.
[ SUCCESS ] XML file: /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-188/.workspace/scm/ov-notebook/notebooks/106-auto-device/model/public/googlenet-v1/FP16/googlenet-v1.xml
[ SUCCESS ] BIN file: /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-188/.workspace/scm/ov-notebook/notebooks/106-auto-device/model/public/googlenet-v1/FP16/googlenet-v1.bin
[ SUCCESS ] Total execution time: 6.53 seconds.
[ SUCCESS ] Memory consumed: 191 MB.
It's been a while, check for a new version of Intel(R) Distribution of OpenVINO(TM) toolkit here https://software.intel.com/content/www/us/en/develop/tools/openvino-toolkit/download.html?cid=other&source=prod&campid=ww_2022_bu_IOTG_OpenVINO-2022-1&content=upg_all&medium=organic or on the GitHub*
[ INFO ] The model was converted to IR v11, the latest model format that corresponds to the source DL framework input/output format. While IR v11 is backwards compatible with OpenVINO Inference Engine API v1.0, please use API v2.0 (as of 2022.1) to take advantage of the latest improvements in IR v11.
Find more information about API v2.0 and IR v11 at https://docs.openvino.ai

Import modules

import cv2
import matplotlib.pyplot as plt
import numpy as np
from openvino.runtime import Core, CompiledModel, AsyncInferQueue, InferRequest
import sys
import time

ie = Core()

if "GPU" not in ie.available_devices:
    display(Markdown('<div class="alert alert-block alert-danger"><b>Warning: </b> A GPU device is not available. This notebook requires GPU device to have meaningful results. </div>'))

Warning: A GPU device is not available. This notebook requires GPU device to have meaningful results.

(1) Simplify selection logic

Default behavior of Core::compile_model API without device_name

By default, compile_model API will select AUTO as device_name if no device is specified.

# set LOG_LEVEL to LOG_INFO
ie.set_property("AUTO", {"LOG_LEVEL":"LOG_INFO"})

# read the model
model = ie.read_model(model="model/public/googlenet-v1/FP16/googlenet-v1.xml")

# load model onto the target device
compiled_model = ie.compile_model(model=model)

if isinstance(compiled_model, CompiledModel):
    print("Successfully compiled model without a device_name.")
Successfully compiled model without a device_name.
# Deleted model will wait until compiling on the selected device is complete.
del compiled_model
print("Deleted compiled_model")
Deleted compiled_model

Explicitly pass AUTO as device_name to Core::compile_model API

It is optional, but explicitly passing AUTO as device_name may improve readability of your code.

# set LOG_LEVEL to LOG_NONE
ie.set_property("AUTO", {"LOG_LEVEL":"LOG_NONE"})

compiled_model = ie.compile_model(model=model, device_name="AUTO")

if isinstance(compiled_model, CompiledModel):
    print("Successfully compiled model using AUTO.")
Successfully compiled model using AUTO.
# Delete model will wait for compiling on the selected device to complete.
del compiled_model
print("Deleted compiled_model")
Deleted compiled_model

(2) Improve first inference latency

One of the benefits of using AUTO device selection is reducing FIL (first inference latency). FIL is model compilation time combined with first inference execution time. Using the CPU device explicitly will produce the shortest first inference latency, as the OpenVINO graph representation loads quickly on CPU using just-in-time (JIT) compilation. The challenge is with GPU devices since OpenCL graph complication to GPU-optimized kernels takes a few seconds to complete. This initialization time may be intolerable for some applications, and to avoid this delay AUTO transparently uses the CPU as the first inference device until the GPU is ready. ### Load an Image

# For demonstration purposes, load the model to CPU and get inputs for buffer preparation.
compiled_model = ie.compile_model(model=model, device_name="CPU")

input_layer_ir = next(iter(compiled_model.inputs))

# Read image in BGR format
image = cv2.imread("../001-hello-world/data/coco.jpg")

# N, C, H, W = batch size, number of channels, height, width
N, C, H, W = input_layer_ir.shape

# Resize image to input size expected by the model
resized_image = cv2.resize(image, (W, H))

# Reshape to match input shape expected by the model
input_image = np.expand_dims(resized_image.transpose(2, 0, 1), 0)

plt.imshow(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))

del compiled_model
../_images/106-auto-device-with-output_12_0.png

Load the model to GPU device and perform inference

if "GPU" not in ie.available_devices:
    print(f"A GPU device is not available. Available devices are: {ie.available_devices}")
else :
    # Start time
    gpu_load_start_time = time.perf_counter()
    compiled_model = ie.compile_model(model=model, device_name="GPU")  # load to GPU

    # get input and output nodes
    input_layer = compiled_model.input(0)
    output_layer = compiled_model.output(0)

    # execute first inference
    results = compiled_model([input_image])[output_layer]

    # Measure time to first inference
    gpu_fil_end_time = time.perf_counter()
    gpu_fil_span = gpu_fil_end_time - gpu_load_start_time
    print(f"Time to load model on GPU device and get first inference: {gpu_fil_end_time-gpu_load_start_time:.2f} seconds.")
    del compiled_model
A GPU device is not available. Available devices are: ['CPU']

Load the model using AUTO device and perform inference

When GPU is the best available device, the first few inferences will be executed on CPU until GPU is ready.

# Start time
auto_load_start_time = time.perf_counter()
compiled_model = ie.compile_model(model=model)  # device_name is AUTO by default

# get input and output nodes
input_layer = compiled_model.input(0)
output_layer = compiled_model.output(0)

# Execute first inference
results = compiled_model([input_image])[output_layer]


# Measure time to first inference
auto_fil_end_time = time.perf_counter()
auto_fil_span = auto_fil_end_time - auto_load_start_time
print(f"Time to load model using AUTO device and get first inference: {auto_fil_end_time-auto_load_start_time:.2f} seconds.")
Time to load model using AUTO device and get first inference: 0.13 seconds.
# Delete model will wait for compiling on the selected device to complete.
del compiled_model

(3) Achieve different performance for different targets

Another advantage when using AUTO device selection with the performance hint. By specifying a THROUGHPUT or LATENCY hint, AUTO optimizes the performance based on the desired metric. The THROUGHPUT hint delivers higher frame per second (FPS) performance than LATENCY hint, which delivers lower latency. The performance hints do not require any device-specific settings and they are completely portable between devices – meaning AUTO can configure the performance hint on whichever device is being used.

More information about using performance hints with AUTO: AUTO#performance-hints

Class and callback definition

class PerformanceMetrics:
    """
    Record the latest performance metrics (fps and latency), update the metrics in each @interval seconds
    :member: fps: Frames per second, indicates the average number of inferences executed each second during the last @interval seconds.
    :member: latency: Average latency of inferences executed in the last @interval seconds.
    :member: start_time: Record the start timestamp of onging @interval seconds duration.
    :member: latency_list: Record the latency of each inference execution over @interval seconds duration.
    :member: interval: The metrics will be updated every @interval seconds
    """
    def __init__(self, interval):
        """
        Create and initilize one instance of class PerformanceMetrics.
        :param: interval: The metrics will be updated every @interval seconds
        :returns:
            Instance of PerformanceMetrics
        """
        self.fps = 0
        self.latency = 0

        self.start_time = time.perf_counter()
        self.latency_list = []
        self.interval = interval

    def update(self, infer_request: InferRequest) -> bool:
        """
        Update the metrics if current ongoing @interval seconds duration is expired. Record the latency only if it is not expired.
        :param: infer_request: InferRequest returned from inference callback, which includes the result of inference request.
        :returns:
            True, if metrics are updated.
            False, if @interval seconds duration is not expired and metrics are not updated.
        """
        self.latency_list.append(infer_request.latency)
        exec_time = time.perf_counter() - self.start_time
        if exec_time >= self.interval:
            # update the performance metrics
            self.start_time = time.perf_counter()
            self.fps = len(self.latency_list) / exec_time
            self.latency = sum(self.latency_list) / len(self.latency_list)
            print(f"throughput: {self.fps: .2f}fps, latency: {self.latency: .2f}ms, time interval:{exec_time: .2f}s")
            sys.stdout.flush()
            self.latency_list = []
            return True
        else :
            return False


class InferContext:
    """
    Inference context. Record and update peforamnce metrics via @metrics, set @feed_inference to False once @remaining_update_num <=0
    :member: metrics: instance of class PerformanceMetrics
    :member: remaining_update_num: the remaining times for peforamnce metrics updating.
    :member: feed_inference: if feed inference request is required or not.
    """
    def __init__(self, update_interval, num):
        """
        Create and initilize one instance of class InferContext.
        :param: update_interval: The performance metrics will be updated every @update_interval seconds. This parameter will be passed to class PerformanceMetrics directly.
        :param: num: The number of times performance metrics are updated.
        :returns:
            Instance of InferContext.
        """
        self.metrics = PerformanceMetrics(update_interval)
        self.remaining_update_num = num
        self.feed_inference = True

    def update(self, infer_request: InferRequest):
        """
        Update the context. Set @feed_inference to False if the number of remaining performance metric updates (@remaining_update_num) reaches 0
        :param: infer_request: InferRequest returned from inference callback, which includes the result of inference request.
        :returns: None
        """
        if self.remaining_update_num <= 0 :
            self.feed_inference = False

        if self.metrics.update(infer_request) :
            self.remaining_update_num = self.remaining_update_num - 1
            if self.remaining_update_num <= 0 :
                self.feed_inference = False


def completion_callback(infer_request: InferRequest, context) -> None:
    """
    callback for the inference request, pass the @infer_request to @context for updating
    :param: infer_request: InferRequest returned for the callback, which includes the result of inference request.
    :param: context: user data which is passed as the second parameter to AsyncInferQueue:start_async()
    :returns: None
    """
    context.update(infer_request)


# Performance metrics update interval (seconds) and number of times
metrics_update_interval = 10
metrics_update_num = 6

Inference when using THROUGHPUT hint

Loop for the inference and updates to the FPS/Latency every @metrics_update_interval seconds

THROUGHPUT_hint_context = InferContext(metrics_update_interval, metrics_update_num)

print("Compiling Model for AUTO device with THROUGHPUT hint")
sys.stdout.flush()

compiled_model = ie.compile_model(model=model, config={"PERFORMANCE_HINT":"THROUGHPUT"})

infer_queue = AsyncInferQueue(compiled_model, 0)  # setting to 0 will query optimal number by default
infer_queue.set_callback(completion_callback)

print(f"Start inference, {metrics_update_num: .0f} groups of FPS/latency will be measured over {metrics_update_interval: .0f}s intervals")
sys.stdout.flush()

while THROUGHPUT_hint_context.feed_inference:
    infer_queue.start_async({input_layer_ir.any_name: input_image}, THROUGHPUT_hint_context)

infer_queue.wait_all()

# Take the FPS and latency of the latest period
THROUGHPUT_hint_fps = THROUGHPUT_hint_context.metrics.fps
THROUGHPUT_hint_latency = THROUGHPUT_hint_context.metrics.latency

print("Done")

del compiled_model
Compiling Model for AUTO device with THROUGHPUT hint
Start inference,  6 groups of FPS/latency will be measured over  10s intervals
throughput:  456.33fps, latency:  24.67ms, time interval: 10.00s
throughput:  462.68fps, latency:  24.94ms, time interval: 10.00s
throughput:  463.97fps, latency:  24.78ms, time interval: 10.00s
throughput:  462.75fps, latency:  24.91ms, time interval: 10.00s
throughput:  462.52fps, latency:  24.90ms, time interval: 10.00s
throughput:  462.88fps, latency:  24.78ms, time interval: 10.00s
Done

Inference with LATENCY hint

Loop for the inference and update the FPS/Latency for each @metrics_update_interval seconds

LATENCY_hint_context = InferContext(metrics_update_interval, metrics_update_num)

print("Compiling Model for AUTO Device with LATENCY hint")
sys.stdout.flush()

compiled_model = ie.compile_model(model=model, config={"PERFORMANCE_HINT":"LATENCY"})

# Setting to 0 will query optimal number by default
infer_queue = AsyncInferQueue(compiled_model, 0)
infer_queue.set_callback(completion_callback)

print(f"Start inference, {metrics_update_num: .0f} groups fps/latency will be out with {metrics_update_interval: .0f}s interval")
sys.stdout.flush()

while LATENCY_hint_context.feed_inference:
    infer_queue.start_async({input_layer_ir.any_name: input_image}, LATENCY_hint_context)

infer_queue.wait_all()

# Take the FPS and latency of the latest period
LATENCY_hint_fps = LATENCY_hint_context.metrics.fps
LATENCY_hint_latency = LATENCY_hint_context.metrics.latency

print("Done")

del compiled_model
Compiling Model for AUTO Device with LATENCY hint
Start inference,  6 groups fps/latency will be out with  10s interval
throughput:  269.59fps, latency:  3.23ms, time interval: 10.00s
throughput:  276.57fps, latency:  3.21ms, time interval: 10.00s
throughput:  276.86fps, latency:  3.21ms, time interval: 10.00s
throughput:  276.92fps, latency:  3.21ms, time interval: 10.00s
throughput:  276.93fps, latency:  3.21ms, time interval: 10.00s
throughput:  275.39fps, latency:  3.23ms, time interval: 10.00s
Done

Difference in FPS and latency

TPUT = 0
LAT = 1
labels = ["THROUGHPUT hint", "LATENCY hint"]

fig1, ax1 = plt.subplots(1, 1)
fig1.patch.set_visible(False)
ax1.axis('tight')
ax1.axis('off')

cell_text = []
cell_text.append(['%.2f%s' % (THROUGHPUT_hint_fps," FPS"), '%.2f%s' % (THROUGHPUT_hint_latency, " ms")])
cell_text.append(['%.2f%s' % (LATENCY_hint_fps," FPS"), '%.2f%s' % (LATENCY_hint_latency, " ms")])

table = ax1.table(cellText=cell_text, colLabels=["FPS (Higher is better)", "Latency (Lower is better)"], rowLabels=labels,
                  rowColours=["deepskyblue"] * 2, colColours=["deepskyblue"] * 2,
                  cellLoc='center', loc='upper left')
table.auto_set_font_size(False)
table.set_fontsize(18)
table.auto_set_column_width(0)
table.auto_set_column_width(1)
table.scale(1, 3)

fig1.tight_layout()
plt.show()
../_images/106-auto-device-with-output_25_0.png
# output the difference
width = 0.4
fontsize = 14

plt.rc('font', size=fontsize)
fig, ax = plt.subplots(1,2, figsize=(10, 8))

rects1 = ax[0].bar([0], THROUGHPUT_hint_fps, width, label=labels[TPUT], color='#557f2d')
rects2 = ax[0].bar([width], LATENCY_hint_fps, width, label=labels[LAT])
ax[0].set_ylabel("frames per second")
ax[0].set_xticks([width / 2])
ax[0].set_xticklabels(["FPS"])
ax[0].set_xlabel("Higher is better")

rects1 = ax[1].bar([0], THROUGHPUT_hint_latency, width, label=labels[TPUT], color='#557f2d')
rects2 = ax[1].bar([width], LATENCY_hint_latency, width, label=labels[LAT])
ax[1].set_ylabel("milliseconds")
ax[1].set_xticks([width / 2])
ax[1].set_xticklabels(["Latency (ms)"])
ax[1].set_xlabel("Lower is better")

fig.suptitle('Performance Hints')
fig.legend(labels, fontsize=fontsize)
fig.tight_layout()

plt.show()
../_images/106-auto-device-with-output_26_0.png