OpenVINO Release Notes#
2024.5 - 20 November 2024#
System Requirements | Release policy | Installation Guides
What’s new#
More GenAI coverage and framework integrations to minimize code changes.
New models supported: Llama 3.2 (1B & 3B), Gemma 2 (2B & 9B), and YOLO11.
LLM support on NPU: Llama 3 8B, Llama 2 7B, Mistral-v0.2-7B, Qwen2-7B-Instruct and Phi-3 Mini-Instruct.
Noteworthy notebooks added: Sam2, Llama3.2, Llama3.2 - Vision, Wav2Lip, Whisper, and Llava.
Preview: support for Flax, a high-performance Python neural network library based on JAX. Its modular design allows for easy customization and accelerated inference on GPUs.
Broader Large Language Model (LLM) support and more model compression techniques.
Optimizations for built-in GPUs on Intel® Core™ Ultra Processors (Series 1) and Intel® Arc™ Graphics include KV Cache compression for memory reduction along with improved usability, and model load time optimizations to improve first token latency for LLMs.
Dynamic quantization was enabled to improve first token latency for LLMs on built-in Intel® GPUs without impacting accuracy on Intel® Core™ Ultra Processors (Series 1). Second token latency will also improve for large batch inference.
A new method to generate synthetic text data is implemented in the Neural Network Compression Framework (NNCF). This will allow LLMs to be compressed more accurately using data-aware methods without datasets. Coming soon: This feature will soon be accessible via Optimum Intel on Hugging Face.
More portability and performance to run AI at the edge, in the cloud, or locally.
Support for Intel® Xeon® 6 Processors with P-cores (formerly codenamed Granite Rapids) and Intel® Core™ Ultra 200V series processors (formerly codenamed Arrow Lake-S).
Preview: GenAI API enables multimodal AI deployment with support for multimodal pipelines for improved contextual awareness, transcription pipelines for easy audio-to-text conversions, and image generation pipelines for streamlined text-to-visual conversions.
Speculative decoding feature added to the GenAI API for improved performance and efficient text generation using a small draft model that is periodically corrected by the full-size model.
Preview: LoRA adapters are now supported in the GenAI API for developers to quickly and efficiently customize image and text generation models for specialized tasks.
The GenAI API now also supports LLMs on NPU allowing developers to specify NPU as the target device, specifically for WhisperPipeline (for whisper-base, whisper-medium, and whisper-small) and LLMPipeline (for Llama 3 8B, Llama 2 7B, Mistral-v0.2-7B, Qwen2-7B-Instruct and Phi-3 Mini-instruct). Use driver version 32.0.100.3104 or later for best performance.
Now deprecated#
Python 3.8 is no longer supported:
OpenVINO™ Runtime#
Common#
Numpy 2.x has been adopted for all currently supported components, including NNCF.
A new constant constructor has been added, enabling constants to be created from data pointer as shared memory. Additionally, it can take ownership of a shared, or other, object, avoiding a two-step process to wrap memory into
ov::Tensor
.Files are now read via the async ReadFile API, reducing the bottleneck for LLM model load times on GPU.
CPU implementation of SliceScatter operator is now available, used for models such as Gemma, supporting increased LLM performance.
CPU Device Plugin#
Gold support of the Intel® Xeon® 6 platform with P-cores (formerly code name Granite Rapids) has been reached.
Support of Intel® Core™ Ultra 200V series processors (formerly codenamed Arrow Lake-S) has been implemented.
LLM performance has been further improved with Rotary Position Embedding optimization; Query, Key, and Value; and multi-layer perceptron fusion optimization.
FP16 support has been extended with SDPA and PagedAttention, improving performance of LLM via both native APIs and the vLLM integration.
Models with LoRA adapters are now supported.
GPU Device Plugin#
The KV cache INT8 compression mechanism is now available for all supported GPUs. It enables a significant reduction in memory consumption, increasing performance with a minimal impact to accuracy (it affects systolic devices slightly more than non-systolic ones). The feature is activated by default for non-systolic devices.
LoRA adapters are now functionally supported on GPU.
A new feature of GPU weightless blob caching enables caching model structure only and reusing the weights from the original model file. Use the new OPTIMIZE_SIZE property to activate.
Dynamic quantization with INT4 and INT8 precisions has been implemented and enabled by default on Intel® Core™ Ultra platforms, improving LLM first token latency.
NPU Device Plugin#
Models retrieved from the OpenVINO cache have a smaller memory footprint now. The plugin releases the cached model (blob) after weights are loaded in NPU regions. Model export is not available in this scenario. Memory consumption is reduced during inference execution with one blob size. This optimization requires the latest NPU driver: 32.0.100.3104.
A driver bug for
ov::intel_npu::device_total_mem_size
has been fixed. The plugin will now report 2GB as the maximum allocatable memory for any driver that does not support graph extension 1.8. Even if older drivers report a larger amount of memory to be available, memory allocation would fail when 2GB are exceeded. Plugin reports the number that driver exposes for any driver that supports graph extension 1.8 (or newer).A new API is used to initialize the model (available in graph extension 1.8).
Inference request set_tensors is now supported.
ov::device::LUID
is now exposed on Windows.LLM-related improvements have been implemented in terms of both memory usage and performance.
AvgPool and MaxPool operator support has been extended, adding support for more PyTorch models.
NOTE: for systems based on Intel® Core™ Ultra Processors Series 2, more than 16GB of RAM may be required to use larger models, such as Llama-2-7B, Mistral-0.2-7B, and Qwen-2-7B (exceeding 4B parameters) with prompt sizes over 1024 tokens.
OpenVINO Python API#
Constant now can be created from openvino.Tensor.
The “release_memory” method has been added for a compiled model, improving control over memory consumption.
OpenVINO Node.js API#
Querying the best device to perform inference of a model with specific operations is now available in JavaScript API.
Contribution guidelines have been improved to make it easier for developers to contribute.
Testing scope has been extended by inference in end-to-end tests.
JavaScript API samples have been improved for readability and ease of running.
TensorFlow Framework Support#
TensorFlow 2.18.0, Keras 3.6.0, NumPy 2.0.2 in Python 3.12, and NumPy 1.26.4 in other Python versions have been added to validation.
Out-of-the-box conversion with static ranks has been improved by devising a new shape for Switch-Merge condition sub-graphs.
Complex type for the following operations is now supported: ExpandDims, Pack, Prod, Rsqrt, ScatterNd, Sub.
The following issues have been fixed:
the corner case with one element in LinSpace to avoid division by zero,
support FP16 and FP64 input types for LeakyRelu,
support non-i32/i64 output index type for ArgMin/Max operations.
PyTorch Framework Support#
PyTorch version 2.5 is now supported.
OpenVINO Model Converter (OVC) now supports TorchScript and ExportedProgram saved on a drive.
The issue of aten.index.Tensor conversion for indices with “None” values has been fixed, helping to support the HF Stable Diffusion model in ExportedProgram format.
ONNX Framework Support#
ONNX version 1.17.0 is now used.
Customers’ models with DequantizeLinear-21, com.microsoft.MatMulNBits, and com.microsoft.QuickGelu operations are now supported.
JAX/Flax Framework Support#
JAX 0.4.35 and Flax 0.10.0 has been added to validation.
jax._src.core.ClosedJaxpr object conversion is now supported.
Vision Transformer from google-research/vision_transformer is now supported (with support for 37 new operations).
OpenVINO Model Server#
The OpenAI API text embedding endpoint has been added, enabling OVMS to be used as a building block for AI applications like RAG. (read more)
The rerank endpoint has been added based on Cohere API, enabling easy similarity detection between a query and a set of documents. It is one of the building blocks for AI applications like RAG and makes integration with frameworks such as langchain easy. (read more)
The following improvements have been done to LLM text generation:
The
echo
sampling parameter together withlogprobs
in thecompletions
endpoint is now supported.Performance has been increased on both CPU and GPU.
Throughput in high-concurrency scenarios has been increased with dynamic_split_fuse for GPU.
Testing coverage and stability has been improved.
The procedure for service deployment and model repository preparation has been simplified.
An experimental version of a Windows binary package - native model server for Windows OS - is available. This release includes a set of limitations and has limited tests coverage. It is intended for testing, while the production-ready release is expected with 2025.0. All feedback is welcome.
Neural Network Compression Framework#
A new nncf.data.generate_text_data() method has been added for generating a synthetic dataset for LLM compression. This approach helps to compress LLMs more accurately in situations when the dataset is not available or not sufficient. See our example for more information about the usage.
Support of data-free and data-aware weight compression methods - nncf.compress_weights() - has been extended with NF4 per-channel quantization, making compressed LLMs more accurate and faster on NPU.
Caching of computed statistics in nncf.compress_weights() is now available, significantly reducing compression time when performing compression of the same LLM multiple times, with different compression parameters. To enable it, set the advanced
statistics_path
parameter of nncf.compress_weights() to the desired file path location.The
backup_mode
optional parameter has been added to nncf.compress_weights(), for specifying the data type for embeddings, convolutions, and last linear layers during 4-bit weight compression. Available options are INT8_ASYM (default), INT8_SYM, and NONE (retains the original floating-point precision of the model weights). In certain situations, non-default value might give better accuracy of compressed LLMs.Preview support is now available for optimizing models in Torch FX format, nncf.quantize(), and nncf.compress_weights() methods. After optimization such models can be directly executed via torch.compile(compressed_model, backend=”openvino”). For more details, see INT8 quantization example.
Memory consumption of data-aware weight compression methods - nncf.compress_weights() – has been reduced significantly, with some variation depending on the model and method.
Support for the following has changed:
NumPy 2 added
PyTorch upgraded to 2.5.1
ONNX upgraded to 1.17
Python 3.8 discontinued
OpenVINO Tokenizers#
Several operations have been introduced and optimized.
Conversion parameters and environment info have been added to
rt_info
, improving reproducibility and debugging.
OpenVINO.GenAI#
The following has been added:
LoRA adapter for the LLMPipeline.
Text2ImagePipeline with LoRA adapter and text2image samples.
VLMPipeline and visual_language_chat sample for text generation models with text and image inputs.
WhisperPipeline and whisper_speech_recognition sample.
speculative_decoding_lm has been moved to LLMPipeline based implementation and is now installed as part of the package.
On NPU, a set of pipelines has been enabled: WhisperPipeline (for whisper-base, whisper-medium, and whisper-small), LLMPipeline (for Llama 3 8B, Llama 2 7B, Mistral-v0.2-7B, Qwen2-7B-Instruct, and Phi-3 Mini-instruct). Use driver version 32.0.100.3104 or later for best performance.
Other Changes and Known Issues#
Jupyter Notebooks#
Known Issues#
Previous 2024 releases#
Deprecation And Support#
Using deprecated features and components is not advised. They are available to enable a smooth transition to new solutions and will be discontinued in the future. To keep using discontinued features, you will have to revert to the last LTS OpenVINO version supporting them. For more details, refer to the OpenVINO Legacy Features and Components page.
Discontinued in 2024#
Runtime components:
Intel® Gaussian & Neural Accelerator (Intel® GNA). Consider using the Neural Processing Unit (NPU) for low-powered systems like Intel® Core™ Ultra or 14th generation and beyond.
OpenVINO C++/C/Python 1.0 APIs (see 2023.3 API transition guide for reference).
All ONNX Frontend legacy API (known as ONNX_IMPORTER_API).
PerfomanceMode.UNDEFINED
property as part of the OpenVINO Python API.
Tools:
Deployment Manager. See installation and deployment guides for current distribution options.
Post-Training Optimization Tool (POT). Neural Network Compression Framework (NNCF) should be used instead.
A Git patch for NNCF integration with huggingface/transformers. The recommended approach is to use huggingface/optimum-intel for applying NNCF optimization on top of models from Hugging Face.
Support for Apache MXNet, Caffe, and Kaldi model formats. Conversion to ONNX may be used as a solution.
The macOS x86_64 debug bins are no longer provided with the OpenVINO toolkit, starting with OpenVINO 2024.5.
Python 3.8 is no longer supported, starting with OpenVINO 2024.5.
As MxNet doesn’t support Python version higher than 3.8, according to the MxNet PyPI project, it is no longer supported by OpenVINO, either.
Discrete Keem Bay support is no longer supported, starting with OpenVINO 2024.5.
Support for discrete devices (formerly codenamed Raptor Lake) is no longer available for NPU.
Deprecated and to be removed in the future#
Intel® Streaming SIMD Extensions (Intel® SSE) will be supported in source code form, but not enabled in the binary package by default, starting with OpenVINO 2025.0.
Ubuntu 20.04 support will be deprecated in future OpenVINO releases due to the end of standard support.
The openvino-nightly PyPI module will soon be discontinued. End-users should proceed with the Simple PyPI nightly repo instead. More information in Release Policy.
The OpenVINO™ Development Tools package (pip install openvino-dev) will be removed from installation options and distribution channels beginning with OpenVINO 2025.0.
Model Optimizer will be discontinued with OpenVINO 2025.0. Consider using the new conversion methods instead. For more details, see the model conversion transition guide.
OpenVINO property Affinity API will be discontinued with OpenVINO 2025.0. It will be replaced with CPU binding configurations (
ov::hint::enable_cpu_pinning
).OpenVINO Model Server components:
“auto shape” and “auto batch size” (reshaping a model in runtime) will be removed in the future. OpenVINO’s dynamic shape models are recommended instead.
A number of notebooks have been deprecated. For an up-to-date listing of available notebooks, refer to the OpenVINO™ Notebook index (openvinotoolkit.github.io).
Legal Information#
You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein.
You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.
The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at www.intel.com or from the OEM or retailer.
No computer system can be absolutely secure.
Intel, Atom, Core, Xeon, OpenVINO, and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
Other names and brands may be claimed as the property of others.
Copyright © 2024, Intel Corporation. All rights reserved.
For more complete information about compiler optimizations, see our Optimization Notice.
Performance varies by use, configuration and other factors.