Performance Benchmarks#

This page presents benchmark results for Intel® Distribution of OpenVINO™ toolkit and OpenVINO Model Server, for a representative selection of public neural networks and Intel® devices. The results may help you decide which hardware to use in your applications or plan AI workload for the hardware you have already implemented in your solutions. Click the buttons below to see the chosen benchmark data. For a more detailed view of performance numbers for generative AI models, check the Generative AI Benchmark Results

Key performance indicators and workload parameters.

For Vision and NLP Models this measures the number of inferences delivered within a latency threshold (for example, number of Frames Per Second - FPS). For GenAI (or Large Language Models) this measures the token rate after the first token aka. 2nd token throughput rate which is presented as tokens/sec. Please click on the “Workload Parameters” tab to learn more about input/output token lengths, etc.

For Vision and NLP models this mhis measures the synchronous execution of inference requests and is reported in milliseconds. Each inference request (for example: preprocess, infer, postprocess) is allowed to complete before the next is started. This performance metric is relevant in usage scenarios where a single image input needs to be acted upon as soon as possible. An example would be the healthcare sector where medical personnel only request analysis of a single ultra sound scanning image or in real-time or near real-time applications for example an industrial robot’s response to actions in its environment or obstacle avoidance for autonomous vehicles. For Transformer models like Stable-Diffusion this measures the time it takes to convert the prompt or input text into a finished image. It is presented in seconds.

The workload parameters affect the performance results of the different models we use for benchmarking. Image processing models have different image size definitions and the Natural Language Processing models have different max token list lengths. All these can be found in detail in the FAQ section. All models are executed using a batch size of 1. Below are the parameters for the GenAI models we display.

  • Input tokens: 1024,

  • Output tokens: 128,

  • number of beams: 1

For text to image:

  • iteration steps: 20,

  • image size (HxW): 256 x 256,

  • input token length: 1024 (the tokens for GenAI models are in English).

Platforms, Configurations, Methodology

For a listing of all platforms and configurations used for testing, refer to the following:

The OpenVINO benchmark setup includes a single system with OpenVINO™, as well as the benchmark application installed. It measures the time spent on actual inference (excluding any pre or post processing) and then reports on the inferences per second (or Frames Per Second).

OpenVINO™ Model Server (OVMS) employs the Intel® Distribution of OpenVINO™ toolkit runtime libraries and exposes a set of models via a convenient inference API over gRPC or HTTP/REST. Its benchmark results are measured with the configuration of multiple-clients-single-server, using two hardware platforms connected by ethernet. Network bandwidth depends on both platforms and models used. It is set not to be a bottleneck for workload intensity. The connection is dedicated only to measuring performance.

See more details about OVMS benchmark setup

The benchmark setup for OVMS consists of four main parts:

OVMS Benchmark Setup Diagram
  • OpenVINO™ Model Server is launched as a docker container on the server platform and it listens to (and answers) requests from clients. OpenVINO™ Model Server is run on the same system as the OpenVINO™ toolkit benchmark application in corresponding benchmarking. Models served by OpenVINO™ Model Server are located in a local file system mounted into the docker container. The OpenVINO™ Model Server instance communicates with other components via ports over a dedicated docker network.

  • Clients are run in separated physical machine referred to as client platform. Clients are implemented in Python3 programming language based on TensorFlow* API and they work as parallel processes. Each client waits for a response from OpenVINO™ Model Server before it will send a new next request. The role played by the clients is also verification of responses.

  • Load balancer works on the client platform in a docker container. HAProxy is used for this purpose. Its main role is counting of requests forwarded from clients to OpenVINO™ Model Server, estimating its latency, and sharing this information by Prometheus service. The reason of locating the load balancer on the client site is to simulate real life scenario that includes impact of physical network on reported metrics.

  • Execution Controller is launched on the client platform. It is responsible for synchronization of the whole measurement process, downloading metrics from the load balancer, and presenting the final report of the execution.

Test performance yourself

You can also test performance for your system yourself, following the guide on getting performance numbers.

Disclaimers

  • Intel® Distribution of OpenVINO™ toolkit performance results are based on release 2024.3, as of July 31, 2024.

  • OpenVINO Model Server performance results are based on release 2024.3, as of Aug. 19, 2024.

The results may not reflect all publicly available updates. Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software, or service activation. Learn more at intel.com, or from the OEM or retailer.

See configuration disclosure for details. No product can be absolutely secure. Performance varies by use, configuration and other factors. Learn more at www.intel.com/PerformanceIndex. Your costs and results may vary. Intel optimizations, for Intel compilers or other products, may not optimize to the same degree for non-Intel products.

Results may vary. For more information, see F.A.Q. See Legal Information.