Throughput Benchmark Python Sample

This sample demonstrates how to estimate performance of a model using Asynchronous Inference Request API in throughput mode. Unlike demos this sample doesn’t have other configurable command line arguments. Feel free to modify sample’s source code to try out different options.

The reported results may deviate from what benchmark_app reports. One example is model input precision for computer vision tasks. benchmark_app sets uint8, while the sample uses default model precision which is usually float32.



Validated Models

alexnet, googlenet-v1, yolo-v3-tf, face-detection-0200

Model Format

OpenVINO™ toolkit Intermediate Representation (*.xml + *.bin), ONNX (*.onnx)

Supported devices


Other language realization


The following Python API is used in the application:




OpenVINO Runtime Version


Get Openvino API version.

Basic Infer Flow

[openvino.runtime.Core], [openvino.runtime.Core.compile_model] [openvino.runtime.InferRequest.get_tensor]

Common API to do inference: compile a model, configure input tensors.

Asynchronous Infer

[openvino.runtime.AsyncInferQueue], [openvino.runtime.AsyncInferQueue.start_async], [openvino.runtime.AsyncInferQueue.wait_all], [openvino.runtime.InferRequest.results]

Do asynchronous inference.

Model Operations


Get inputs of a model.

Tensor Operations

[openvino.runtime.Tensor.get_shape], []

Get a tensor shape and its data.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# Copyright (C) 2022 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

import logging as log
import sys
import statistics
from time import perf_counter

import numpy as np
import openvino as ov
from openvino.runtime import get_version
from openvino.runtime.utils.types import get_dtype

def fill_tensor_random(tensor):
    dtype = get_dtype(tensor.element_type)
    rand_min, rand_max = (0, 1) if dtype == bool else (np.iinfo(np.uint8).min, np.iinfo(np.uint8).max)
    # np.random.uniform excludes high: add 1 to have it generated
    if np.dtype(dtype).kind in ['i', 'u', 'b']:
        rand_max += 1
    rs = np.random.RandomState(np.random.MT19937(np.random.SeedSequence(0)))
    if 0 == tensor.get_size():
        raise RuntimeError("Models with dynamic shapes aren't supported. Input tensors must have specific shapes before inference")[:] = rs.uniform(rand_min, rand_max, list(tensor.shape)).astype(dtype)

def main():
    log.basicConfig(format='[ %(levelname)s ] %(message)s', level=log.INFO, stream=sys.stdout)'OpenVINO:')"{'Build ':.<39} {get_version()}")
    if len(sys.argv) != 2:'Usage: {sys.argv[0]} <path_to_model>')
        return 1
    # Optimize for throughput. Best throughput can be reached by
    # running multiple openvino.runtime.InferRequest instances asyncronously

    # Create Core and use it to compile a model.
    # Pick a device by replacing CPU, for example MULTI:CPU(4),GPU(8).
    # It is possible to set CUMULATIVE_THROUGHPUT as PERFORMANCE_HINT for AUTO device
    core = ov.Core()
    compiled_model = core.compile_model(sys.argv[1], 'CPU', tput)
    # AsyncInferQueue creates optimal number of InferRequest instances
    ireqs = ov.AsyncInferQueue(compiled_model)
    # Fill input data for ireqs
    for ireq in ireqs:
        for model_input in compiled_model.inputs:
    # Warm up
    for _ in ireqs:
    # Benchmark for seconds_to_run seconds and at least niter iterations
    seconds_to_run = 10
    niter = 10
    latencies = []
    in_fly = set()
    start = perf_counter()
    time_point_to_finish = start + seconds_to_run
    while perf_counter() < time_point_to_finish or len(latencies) + len(in_fly) < niter:
        idle_id = ireqs.get_idle_request_id()
        if idle_id in in_fly:
    duration = perf_counter() - start
    for infer_request_id in in_fly:
    # Report results
    fps = len(latencies) / duration'Count:          {len(latencies)} iterations')'Duration:       {duration * 1e3:.2f} ms')'Latency:')'    Median:     {statistics.median(latencies):.2f} ms')'    Average:    {sum(latencies) / len(latencies):.2f} ms')'    Min:        {min(latencies):.2f} ms')'    Max:        {max(latencies):.2f} ms')'Throughput: {fps:.2f} FPS')

if __name__ == '__main__':

How It Works

The sample compiles a model for a given device, randomly generates input data, performs asynchronous inference multiple times for a given number of seconds. Then processes and reports performance results.

You can see the explicit description of each sample step at Integration Steps section of “Integrate OpenVINO™ Runtime with Your Application” guide.


python <path_to_model>

To run the sample, you need to specify a model:


Before running the sample with a trained model, make sure the model is converted to the intermediate representation (IR) format (*.xml + *.bin) using model conversion API.

The sample accepts models in ONNX format (.onnx) that do not require preprocessing.


  1. Install the openvino-dev Python package to use Open Model Zoo Tools:

    python -m pip install openvino-dev[caffe]
  2. Download a pre-trained model using:

    omz_downloader --name googlenet-v1
  3. If a model is not in the IR or ONNX format, it must be converted. You can do this using the model converter:

    omz_converter --name googlenet-v1
  4. Perform benchmarking using the googlenet-v1 model on a CPU:

    python googlenet-v1.xml

Sample Output

The application outputs performance results.

[ INFO ] OpenVINO:
[ INFO ] Build ................................. <version>
[ INFO ] Count:          2817 iterations
[ INFO ] Duration:       10012.65 ms
[ INFO ] Latency:
[ INFO ]     Median:     13.80 ms
[ INFO ]     Average:    14.10 ms
[ INFO ]     Min:        8.35 ms
[ INFO ]     Max:        28.38 ms
[ INFO ] Throughput: 281.34 FPS