OpenVINO Release Notes

2024.1 - 24 April 2024

System Requirements | Release policy | Installation Guides

What’s new

  • More Gen AI coverage and framework integrations to minimize code changes.

    • Mixtral and URLNet models optimized for performance improvements on Intel® Xeon® processors.

    • Stable Diffusion 1.5, ChatGLM3-6B, and Qwen-7B models optimized for improved inference speed on Intel® Core™ Ultra processors with integrated GPU.

    • Support for Falcon-7B-Instruct, a GenAI Large Language Model (LLM) ready-to-use chat/instruct model with superior performance metrics.

    • New Jupyter Notebooks added: YOLO V9, YOLO V8 Oriented Bounding Boxes Detection (OOB), Stable Diffusion in Keras, MobileCLIP, RMBG-v1.4 Background Removal, Magika, TripoSR, AnimateAnyone, LLaVA-Next, and RAG system with OpenVINO and LangChain.

  • Broader LLM model support and more model compression techniques.

    • LLM compilation time reduced through additional optimizations with compressed embedding. Improved 1st token performance of LLMs on 4th and 5th generations of Intel® Xeon® processors with Intel® Advanced Matrix Extensions (Intel® AMX).

    • Better LLM compression and improved performance with oneDNN, INT4, and INT8 support for Intel® Arc™ GPUs.

    • Significant memory reduction for select smaller GenAI models on Intel® Core™ Ultra processors with integrated GPU.

  • More portability and performance to run AI at the edge, in the cloud, or locally.

    • The preview NPU plugin for Intel® Core™ Ultra processors is now available in the OpenVINO open-source GitHub repository, in addition to the main OpenVINO package on PyPI.

    • The JavaScript API is now more easily accessible through the npm repository, enabling JavaScript developers’ seamless access to the OpenVINO API.

    • FP16 inference on ARM processors now enabled for the Convolutional Neural Network (CNN) by default.

OpenVINO™ Runtime

Common

  • Unicode file paths for cached models are now supported on Windows.

  • Pad pre-processing API to extend input tensor on edges with constants.

  • A fix for inference failures of certain image generation models has been implemented (fused I/O port names after transformation).

  • Compiler’s warnings-as-errors option is now on, improving the coding criteria and quality. Build warnings will not be allowed for new OpenVINO code and the existing warnings have been fixed.

AUTO Inference Mode

  • Returning the ov::enable_profiling value from ov::CompiledModel is now supported.

CPU Device Plugin

  • 1st token performance of LLMs has been improved on the 4th and 5th generations of Intel® Xeon® processors with Intel® Advanced Matrix Extensions (Intel® AMX).

  • LLM compilation time and memory footprint have been improved through additional optimizations with compressed embeddings.

  • Performance of MoE (e.g. Mixtral), Gemma, and GPT-J has been improved further.

  • Performance has been improved significantly for a wide set of models on ARM devices.

  • FP16 inference precision is now the default for all types of models on ARM devices.

  • CPU architecture-agnostic build has been implemented, to enable unified binary distribution on different ARM devices.

GPU Device Plugin

  • LLM first token latency has been improved on both integrated and discrete GPU platforms.

  • For the ChatGLM3-6B model, average token latency has been improved on integrated GPU platforms.

  • For Stable Diffusion 1.5 FP16 precision, performance has been improved on Intel® Core™ Ultra processors.

NPU Device Plugin

  • NPU Plugin is now part of the OpenVINO GitHub repository. All the most recent plugin changes will be immediately available in the repo. Note that NPU is part of Intel® Core™ Ultra processors.

  • New OpenVINO™ notebook “Hello, NPU!” introducing NPU usage with OpenVINO has been added.

  • Version 22H2 or later is required for Microsoft Windows® 11 64-bit to run inference on NPU.

OpenVINO Python API

  • GIL-free creation of RemoteTensors is now used - holding GIL means that the process is not suited for multithreading and removing the GIL lock will increase performance which is critical for the concept of Remote Tensors.

  • Packed data type BF16 on the Python API level has been added, opening a new way of supporting data types not handled by numpy.

  • ‘pad’ operator support for ov::preprocess::PrePostProcessorItem has been added.

  • ov.PartialShape.dynamic(int) definition has been provided.

OpenVINO C API

  • Two new pre-processing APIs for scale and mean have been added.

OpenVINO Node.js API

  • New methods to align JavaScript API with CPP API have been added, such as CompiledModel.exportModel(), core.import_model(), Core set/get property and Tensor.get_size(), and Model.is_dynamic().

  • Documentation has been extended to help developers start integrating JavaScript applications with OpenVINO™.

TensorFlow Framework Support

  • tf.keras.layers.TextVectorization tokenizer is now supported.

  • Conversion of models with Variable and HashTable (dictionary) resources has been improved.

  • 8 NEW operations have been added (see the list here, marked as NEW).

  • 10 operations have received complex tensor support.

  • Input tensor names for TF1 models have been adjusted to have a single name per input.

  • Hugging Face model support coverage has increased significantly, due to:

    • extraction of input signature of a model in memory has been fixed,

    • reading of variable values for a model in memory has been fixed.

PyTorch Framework Support

  • ModuleExtension, a new type of extension for PyTorch models is now supported (PR #23536).

  • 22 NEW operations have been added.

  • Experimental support for models produced by torch.export (FX graph) has been added (PR #23815).

ONNX Framework Support

  • 8 new operations have been added.

OpenVINO Model Server

  • OpenVINO™ Runtime backend used is now 2024.1

  • OpenVINO™ models with String data type on output are supported. Now, OpenVINO™ Model Server can support models with input and output of the String type, so developers can take advantage of the tokenization built into the model as the first layer. Developers can also rely on any postprocessing embedded into the model which returns text only. Check the demo on string input data with the universal-sentence-encoder model and the String output model demo.

  • MediaPipe Python calculators have been updated to support relative paths for all related configuration and Python code files. Now, the complete graph configuration folder can be deployed in an arbitrary path without any code changes.

  • KServe REST API support has been extended to properly handle the string format in JSON body, just like the binary format compatible with NVIDIA Triton™.

  • A demo showcasing a full RAG algorithm fully delegated to the model server has been added.

Neural Network Compression Framework

  • Model subgraphs can now be defined in the ignored scope for INT8 Post-training Quantization, nncf.quantize(), which simplifies excluding accuracy-sensitive layers from quantization.

  • A batch size of more than 1 is now partially supported for INT8 Post-training Quantization, speeding up the process. Note that it is not recommended for transformer-based models as it may impact accuracy. Here is an example demo.

  • Now it is possible to apply fine-tuning on INT8 models after Post-training Quantization to improve model accuracy and make it easier to move from post-training to training-aware quantization. Here is an example demo.

OpenVINO Tokenizers

  • TensorFlow support has been extended - TextVectorization layer translation:

    • Aligned existing ops with TF ops and added a translator for them.

    • Added new ragged tensor ops and string ops.

  • A new tokenizer type, RWKV is now supported:

    • Added Trie tokenizer and Fuse op for ragged tensors.

    • A new way to get OV Tokenizers: build a vocab from file.

  • Tokenizer caching has been redesigned to work with the OpenVINO™ model caching mechanism.

Other Changes and Known Issues

Jupyter Notebooks

The default branch for the OpenVINO™ Notebooks repository has been changed from ‘main’ to ‘latest’. The ‘main’ branch of the notebooks repository is now deprecated and will be maintained until September 30, 2024.

The new branch, ‘latest’, offers a better user experience and simplifies maintenance due to significant refactoring and an improved directory naming structure.

Use the local README.md file and OpenVINO™ Notebooks at GitHub Pages to navigate through the content.

The following notebooks have been updated or newly added:

Known Issues

Component - CPU Plugin
ID - N/A
Description:
Default CPU pinning policy on Windows has been changed to follow Windows’ policy instead of controlling the CPU pinning in the OpenVINO plugin. This brings certain dynamic or performance variance on Windows. Developers can use ov::hint::enable_cpu_pinning to enable or disable CPU pinning explicitly.
Component - Hardware Configuration
ID - N/A
Description:
Reduced performance for LLMs may be observed on newer CPUs. To mitigate, modify the default settings in BIOS to
change the system into 2 NUMA node system:
1. Enter the BIOS configuration menu.
2. Select EDKII Menu -> Socket Configuration -> Uncore Configuration -> Uncore General Configuration -> SNC.
3. The SNC setting is set to AUTO by default. Change the SNC setting to disabled to configure one NUMA node per processor socket upon boot.
4. After system reboot, confirm the NUMA node setting using: numatcl -H. Expect to see only nodes 0 and 1 on a
2-socket system with the following mapping:
Node - 0 - 1
0 - 10 - 21
1 - 21 - 10

Previous 2024 releases

2024.0 - 06 March 2024

What’s new

  • More Generative AI coverage and framework integrations to minimize code changes.

    • Improved out-of-the-box experience for TensorFlow sentence encoding models through the installation of OpenVINO™ toolkit Tokenizers.

    • New and noteworthy models validated: Mistral, StableLM-tuned-alpha-3b, and StableLM-Epoch-3B.

    • OpenVINO™ toolkit now supports Mixture of Experts (MoE), a new architecture that helps process more efficient generative models through the pipeline.

    • JavaScript developers now have seamless access to OpenVINO API. This new binding enables a smooth integration with JavaScript API.

  • Broader Large Language Model (LLM) support and more model compression techniques.

    • Broader Large Language Model (LLM) support and more model compression techniques.

    • Improved quality on INT4 weight compression for LLMs by adding the popular technique, Activation-aware Weight Quantization, to the Neural Network Compression Framework (NNCF). This addition reduces memory requirements and helps speed up token generation.

    • Experience enhanced LLM performance on Intel® CPUs, with internal memory state enhancement, and INT8 precision for KV-cache. Specifically tailored for multi-query LLMs like ChatGLM.

    • The OpenVINO™ 2024.0 release makes it easier for developers, by integrating more OpenVINO™ features with the Hugging Face ecosystem. Store quantization configurations for popular models directly in Hugging Face to compress models into INT4 format while preserving accuracy and performance.

  • More portability and performance to run AI at the edge, in the cloud, or locally.

    • A preview plugin architecture of the integrated Neural Processor Unit (NPU) as part of Intel® Core™ Ultra processor (codename Meteor Lake) is now included in the main OpenVINO™ package on PyPI.

    • Improved performance on ARM by enabling the ARM threading library. In addition, we now support multi-core ARM processors and enabled FP16 precision by default on MacOS.

    • New and improved LLM serving samples from OpenVINO Model Server for multi-batch inputs and Retrieval Augmented Generation (RAG).

OpenVINO™ Runtime

Common

  • The legacy API for CPP and Python bindings has been removed.

  • StringTensor support has been extended by operators such as Gather, Reshape, and Concat, as a foundation to improve support for tokenizer operators and compliance with the TensorFlow Hub.

  • oneDNN has been updated to v3.3. (see oneDNN release notes).

CPU Device Plugin

  • LLM performance on Intel® CPU platforms has been improved for systems based on AVX2 and AVX512, using dynamic quantization and internal memory state optimization, such as INT8 precision for KV-cache. 13th and 14th generations of Intel® Core™ processors and Intel® Core™ Ultra processors use AVX2 for CPU execution, and these platforms will benefit from speedup. Enable these features by setting "DYNAMIC_QUANTIZATION_GROUP_SIZE":"32" and "KV_CACHE_PRECISION":"u8" in the configuration file.

  • The ov::affinity API configuration is now deprecated and will be removed in release 2025.0.

  • The following have been improved and optimized:

    • Multi-query structure LLMs (such as ChatGLM 2/3) for BF16 on the 4th and 5th generation Intel® Xeon® Scalable processors.

    • Mixtral model performance.

    • 8-bit compressed LLM compilation time and memory usage, valuable for models with large embeddings like Qwen.

    • Convolutional networks in FP16 precision on ARM processors.

GPU Device Plugin

  • The following have been improved and optimized:

    • Average token latency for LLMs on integrated GPU (iGPU) platforms, using INT4-compressed models with large context size on Intel® Core™ Ultra processors.

    • LLM beam search performance on iGPU. Both average and first-token latency decrease may be expected for larger context sizes.

    • Multi-batch performance of YOLOv5 on iGPU platforms.

  • Memory usage for LLMs has been optimized, enabling ‘7B’ models with larger context on 16Gb platforms.

NPU Device Plugin (preview feature)

  • The NPU plugin for OpenVINO™ is now available through PyPI (run “pip install openvino”).

OpenVINO Python API

  • .add_extension method signatures have been aligned, improving API behavior for better user experience.

OpenVINO C API

  • ov_property_key_cache_mode (C++ ov::cache_mode) now enables the optimize_size and optimize_speed modes to set/get model cache.

  • The VA surface on Windows exception has been fixed.

OpenVINO Node.js API

  • OpenVINO - JS bindings are consistent with the OpenVINO C++ API.

  • A new distribution channel is now available: Node Package Manager (npm) software registry (check the installation guide).

  • JavaScript API is now available for Windows users, as some limitations for platforms other than Linux have been removed.

TensorFlow Framework Support

  • String tensors are now natively supported, handled on input, output, and intermediate layers (PR #22024).

    • TensorFlow Hub universal-sentence-encoder-multilingual inferred out of the box

    • string tensors supported for Gather, Concat, and Reshape operations

    • integration with openvino-tokenizers module - importing openvino-tokenizers automatically patches TensorFlow FE with the required translators for models with tokenization

  • Fallback for Model Optimizer by operation to the legacy Frontend is no longer available. Fallback by .json config will remain until Model Optimizer is discontinued (PR #21523).

  • Support for the following has been added:

  • The following issues have been fixed:

    • UpSampling2D conversion crashed when input type as int16 (PR #20838).

    • IndexError list index for Squeeze (PR #22326).

    • Correct FloorDiv computation for signed integers (PR #22684).

    • Fixed bad cast error for tf.TensorShape to ov.PartialShape (PR #22813).

    • Fixed reading tf.string attributes for models in memory (PR #22752).

ONNX Framework Support

  • ONNX Frontend now uses the OpenVINO API 2.0.

PyTorch Framework Support

  • Names for outputs unpacked from dict or tuple are now clearer (PR #22821).

  • FX Graph (torch.compile) now supports kwarg inputs, improving data type coverage. (PR #22397).

OpenVINO Model Server

  • OpenVINO™ Runtime backend used is now 2024.0.

  • Text generation demo now supports multi batch size, with streaming and unary clients.

  • The REST client now supports servables based on mediapipe graphs, including python pipeline nodes.

  • Included dependencies have received security-related updates.

  • Reshaping a model in runtime based on the incoming requests (auto shape and auto batch size) is deprecated and will be removed in the future. Using OpenVINO’s dynamic shape models is recommended instead.

Neural Network Compression Framework (NNCF)

  • The Activation-aware Weight Quantization (AWQ) algorithm for data-aware 4-bit weights compression is now available. It facilitates better accuracy for compressed LLMs with high ratio of 4-bit weights. To enable it, use the dedicated awq optional parameter of the nncf.compress_weights() API.

  • ONNX models are now supported in Post-training Quantization with Accuracy Control, through the nncf.quantize_with_accuracy_control(), method. It may be used for models in the OpenVINO IR and ONNX formats.

  • A weight compression example tutorial is now available, demonstrating how to find the appropriate hyperparameters for the TinyLLama model from the Hugging Face Transformers, as well as other LLMs, with some modifications.

OpenVINO Tokenizer

  • Regex support has been improved.

  • Model coverage has been improved.

  • Tokenizer metadata has been added to rt_info.

  • Limited support for Tensorflow Text models has been added: convert MUSE for TF Hub with string inputs.

  • OpenVINO Tokenizers have their own repository now: /openvino_tokenizers

Other Changes and Known Issues

Jupyter Notebooks

The following notebooks have been updated or newly added:

Known issues

Component - CPU Plugin
ID - N/A
Description:
Starting with 24.0, model inputs and outputs will no longer have tensor names, unless explicitly set to align with the PyTorch framework behavior.
Component - GPU runtime
ID - 132376
Description:
First-inference latency slow down for LLMs on Intel® Core™ Ultra processors. Up to 10-20% drop may occur due to radical memory optimization for processing long sequences (about 1.5-2 GB reduced memory usage).
Component - CPU runtime
ID - N/A
Description:
Performance results (first token latency) may vary from those offered by the previous OpenVINO version, for “latency” hint inference of LLMs with long prompts on Xeon platforms with 2 or more sockets. The reason is that all CPU cores of just the single socket running the application are employed, lowering the memory overhead for LLMs when numa control is not used.
Workaround:
The behavior is expected but stream and thread configuration may be used to include cores from all sockets.

Deprecation And Support

Using deprecated features and components is not advised. They are available to enable a smooth transition to new solutions and will be discontinued in the future. To keep using discontinued features, you will have to revert to the last LTS OpenVINO version supporting them. For more details, refer to the OpenVINO Legacy Features and Components page.

Discontinued in 2024

  • Runtime components:

    • Intel® Gaussian & Neural Accelerator (Intel® GNA). Consider using the Neural Processing Unit (NPU) for low-powered systems like Intel® Core™ Ultra or 14th generation and beyond.

    • OpenVINO C++/C/Python 1.0 APIs (see 2023.3 API transition guide for reference).

    • All ONNX Frontend legacy API (known as ONNX_IMPORTER_API).

    • PerfomanceMode.UNDEFINED property as part of the OpenVINO Python API.

  • Tools:

Deprecated and to be removed in the future