OpenVINO Release Notes#
2024.4 - 19 September 2024#
System Requirements | Release policy | Installation Guides
What’s new#
More Gen AI coverage and framework integrations to minimize code changes.
Support for GLM-4-9B Chat, MiniCPM-1B, Llama 3 and 3.1, Phi-3-Mini, Phi-3-Medium and YOLOX-s models.
Noteworthy notebooks added: Florence-2, NuExtract-tiny Structure Extraction, Flux.1 Image Generation, PixArt-α: Photorealistic Text-to-Image Synthesis, and Phi-3-Vision Visual Language Assistant.
Broader Large Language Model (LLM) support and more model compression techniques.
OpenVINO™ runtime optimized for Intel® Xe Matrix Extensions (Intel® XMX) systolic arrays on built-in GPUs for efficient matrix multiplication resulting in significant LLM performance boost with improved 1st and 2nd token latency, as well as a smaller memory footprint on Intel® Core™ Ultra Processors (Series 2).
Memory sharing enabled for NPUs on Intel® Core™ Ultra Processors (Series 2) for efficient pipeline integration without memory copy overhead.
Addition of the PagedAttention feature for discrete GPUs* enables a significant boost in throughput for parallel inferencing when serving LLMs on Intel® Arc™ Graphics or Intel® Data Center GPU Flex Series.
More portability and performance to run AI at the edge, in the cloud, or locally.
Support for Intel® Core Ultra Processors Series 2 (formerly codenamed Lunar Lake) on Windows.
OpenVINO™ Model Server now comes with production-quality support for OpenAI-compatible API which enables significantly higher throughput for parallel inferencing on Intel® Xeon® processors when serving LLMs to many concurrent users.
Improved performance and memory consumption with prefix caching, KV cache compression, and other optimizations for serving LLMs using OpenVINO™ Model Server.
Support for Python 3.12.
Support for Red Hat Enterprise Linux (RHEL) version 9.3 - 9.4.
Now deprecated#
The following will not be available beyond the 2024.4 OpenVINO version:
The macOS x86_64 debug bins
Python 3.8
dKMB support
Intel® Streaming SIMD Extensions (Intel® SSE) will be supported in source code form, but not enabled in the binary package by default, starting with OpenVINO 2025.0.
Common#
Encryption and decryption of topology in model cache is now supported with callback functions provided by the user (CPU only for now; ov::cache_encryption_callbacks).
The Ubuntu20 and Ubuntu22 Docker images now include the tokenizers and GenAI CPP modules, including pre-installed Python modules, in development versions of these images.
Python 3.12 is now supported.
CPU Device Plugin#
The following is now supported:
Tensor parallel feature for multi-socket CPU inference, with performance improvement for LLMs with 6B+ parameters (enabled through model_distribution_policy hint configurations).
RMSNorm operator, optimized with JIT kernel to improve both the 1st and 2nd token performance of LLMs.
The following has been improved:
vLLM support, with PagedAttention exposing attention score as the second output. It can now be used in the cache eviction algorithm to improve LLM serving performance.
1st token performance with Llama series of models, with additional CPU operator optimization (such as MLP, SDPA) on BF16 precision.
Default oneTBB version on Linux is now 2021.13.0, improving overall performance on latest Intel XEON platforms.
MXFP4 weight compression models (compressing weights to 4-bit with the e2m1 data type without a zero point and with 8-bit e8m0 scales) have been optimized for Xeon platforms thanks to fullyconnected compressed weight LLM support.
The following has been fixed:
Memory leak when ov::num_streams value is 0.
CPU affinity mask is changed after OpenVINO execution when OpenVINO is compiled with -DTHREADING=SEQ.
GPU Device Plugin#
Dynamic quantization for LLMs is now supported on discrete GPU platforms.
Stable Diffusion 3 is now supported with good accuracy on Intel GPU platforms.
Both first and second token latency for LLMs have been improved on Intel GPU platforms.
The issue of model cache not regenerating with the value changes of
ov::hint::performance_mode
orov::hint::dynamic_quantization_group_size
has been fixed.
NPU Device Plugin#
Remote Tensor API is now supported.
You can now query the available number of tiles (ov::intel_npu::max_tiles) and force a specific number of tiles to be used by the model, per inference request (ov::intel_npu::tiles). Note: ov::intel_npu::tiles overrides the default number of tiles selected by the compiler based on performance hints (ov::hint::performance_mode). Any tile number other than 1 may be a problem for cross platform compatibility, if not tested explicitly versus the max_tiles value.
You can now bypass the model caching mechanism in the driver (ov::intel_npu::bypass_umd_caching). Read more about driver and OpenVINO caching.
Memory footprint at model execution has been reduced by one blob (compiled model) size. For execution, the plugin no longer retrieves the compiled model from the driver, it uses the level zero graph handle directly, instead. The compiled model is now retrieved from the driver only during the export method.
OpenVINO Python API#
Openvino.Tensor, when created in the shared memory mode, now prevents “garbage collection” of numpy memory.
The
openvino.experimental
submodule is now available, providing access to experimental functionalities under development.New python-exclusive openvino.Model constructors have been added.
Image padding in PreProcessor is now available.
OpenVINO Runtime is now compatible with numpy 2.0.
OpenVINO Node.js API#
The following has been improved
Unit tests for increased efficiency and stability
Security updates applied to dependencies
Electron compatibility is now confirmed with new end-to-end tests.
New API methods added.
TensorFlow Framework Support#
TensorFlow 2.17.0 is now supported.
JAX 0.4.31 is now supported via a path of jax2tf with native_serialization=False
8 NEW* operations have been added.
Tensor lists with multiple undefined dimensions in element_shape are now supported, enabling support for TF Hub lite0-detection/versions/1 model.
PyTorch Framework Support#
Torch 2.4 is now supported.
Inplace ops are now supported automatically if the regular version is supported.
Symmetric GPTQ model from Hugging Face will now be automatically converted to the signed type (INT4) and zero-points will be removed.
ONNX Framework Support#
ONNX 1.16.0 is now supported
models with constants/inputs of uint4/int4 types are now supported.
4 NEW operations have been added.
OpenVINO Model Server#
OpenAI API for text generation is now officially supported and recommended for production usage. It comes with the following new features:
Prefix caching feature, caching the prompt evaluation to speed up text generation.
Ability to compress the KV Cache to a lower precision, reducing memory consumption without a significant loss of accuracy.
stop
sampling parameters, to define a sequence that stops text generation.logprobs
sampling parameter, returning the probabilities to returned tokens.Generic metrics related to execution of the MediaPipe graph that can be used for autoscaling based on the current load and the level of concurrency.
Demo of text generation horizontal scalability using basic docker containers and Kubernetes.
Automatic cancelling of text generation for disconnected clients.
Non-UTF-8 responses from the model can be now automatically changed to Unicode replacement characters, due to their configurable handling.
Intel GPU with paged attention is now supported.
Support for Llama3.1 models.
The following has been improved:
Handling of model templates without bos_token is now fixed.
Performance of the multinomial sampling algorithm.
finish_reason
in the response correctly determines reaching max_tokens (length) and completing the sequence (stop).Security and stability.
Neural Network Compression Framework#
The LoRA Correction algorithm is now included in the Weight Compression method, improving the accuracy of INT4-compressed models on top of other data-aware algorithms, such as AWQ and Scale Estimation. To enable it, set the lora_correction option to True in nncf.compress_weights().
The GPTQ compression algorithm can now be combined with the Scale Estimation algorithm, making it possible to run GPTQ, AWQ, and Scale Estimation together, for the optimum-accuracy INT4-compressed models.
INT8 quantization of LSTMSequence and Convolution operations for constant inputs is now enabled, resulting in better performance and reduced model size.
OpenVINO Tokenizers#
Split and BPE tokenization operations have been reimplemented, resulting in improved tokenization accuracy and performance.
New building options are now available, offering up to a 12x reduction in binary size.
An operation is now available to validate and skip/replace model-generated non-Unicode bytecode sequences during detokenization.
OpenVINO.GenAI#
New samples and pipelines are now available:
An example IterableStreamer implementation in multinomial_causal_lm/python sample
GenAI compilation is now available as part of OpenVINO via the –DOPENVINO_EXTRA_MODULES CMake option.
Other Changes and Known Issues#
Jupyter Notebooks#
The list of supported models in LLM chatbot now includes Phi3.5, Gemma2 support
Known Issues#
Previous 2024 releases#
Deprecation And Support#
Using deprecated features and components is not advised. They are available to enable a smooth transition to new solutions and will be discontinued in the future. To keep using discontinued features, you will have to revert to the last LTS OpenVINO version supporting them. For more details, refer to the OpenVINO Legacy Features and Components page.
Discontinued in 2024#
Runtime components:
Intel® Gaussian & Neural Accelerator (Intel® GNA). Consider using the Neural Processing Unit (NPU) for low-powered systems like Intel® Core™ Ultra or 14th generation and beyond.
OpenVINO C++/C/Python 1.0 APIs (see 2023.3 API transition guide for reference).
All ONNX Frontend legacy API (known as ONNX_IMPORTER_API).
PerfomanceMode.UNDEFINED
property as part of the OpenVINO Python API.
Tools:
Deployment Manager. See installation and deployment guides for current distribution options.
Post-Training Optimization Tool (POT). Neural Network Compression Framework (NNCF) should be used instead.
A Git patch for NNCF integration with huggingface/transformers. The recommended approach is to use huggingface/optimum-intel for applying NNCF optimization on top of models from Hugging Face.
Support for Apache MXNet, Caffe, and Kaldi model formats. Conversion to ONNX may be used as a solution.
Deprecated and to be removed in the future#
The macOS x86_64 debug bins will no longer be provided with the OpenVINO toolkit, starting with OpenVINO 2024.5.
Python 3.8 is now considered deprecated, and it will not be available beyond the 2024.4 OpenVINO version.
dKMB support is now considered deprecated and will be fully removed with OpenVINO 2024.5
Intel® Streaming SIMD Extensions (Intel® SSE) will be supported in source code form, but not enabled in the binary package by default, starting with OpenVINO 2025.0
The openvino-nightly PyPI module will soon be discontinued. End-users should proceed with the Simple PyPI nightly repo instead. More information in Release Policy.
The OpenVINO™ Development Tools package (pip install openvino-dev) will be removed from installation options and distribution channels beginning with OpenVINO 2025.0.
Model Optimizer will be discontinued with OpenVINO 2025.0. Consider using the new conversion methods instead. For more details, see the model conversion transition guide.
OpenVINO property Affinity API will be discontinued with OpenVINO 2025.0. It will be replaced with CPU binding configurations (
ov::hint::enable_cpu_pinning
).OpenVINO Model Server components:
“auto shape” and “auto batch size” (reshaping a model in runtime) will be removed in the future. OpenVINO’s dynamic shape models are recommended instead.
A number of notebooks have been deprecated. For an up-to-date listing of available notebooks, refer to the OpenVINO™ Notebook index (openvinotoolkit.github.io).
Legal Information#
You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein.
You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.
The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at www.intel.com or from the OEM or retailer.
No computer system can be absolutely secure.
Intel, Atom, Core, Xeon, OpenVINO, and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
Other names and brands may be claimed as the property of others.
Copyright © 2024, Intel Corporation. All rights reserved.
For more complete information about compiler optimizations, see our Optimization Notice.
Performance varies by use, configuration and other factors.