OpenVINO Release Notes#
2025.2 - 18 June 2025#
System Requirements | Release policy | Installation Guides
What’s new#
More Gen AI coverage and frameworks integrations to minimize code changes
New models supported on CPUs & GPUs: Phi-4, Mistral-7B-Instruct-v0.3, SD-XL Inpainting 0.1, Stable Diffusion 3.5 Large Turbo, Phi-4-reasoning, Qwen3, and Qwen2.5-VL-3B-Instruct. Mistral 7B Instruct v0.3 is also supported on NPUs.
Preview: OpenVINO ™ GenAI introduces a text-to-speech pipeline for the SpeechT5 TTS model, while the new RAG backend offers developers a simplified API that delivers reduced memory usage and improved performance.
Preview: OpenVINO™ GenAI offers a GGUF Reader for seamless integration of llama.cpp based LLMs, with Python and C++ pipelines that load GGUF models, build OpenVINO graphs, and run GPU inference on-the-fly. Validated for popular models: DeepSeek-R1-Distill-Qwen (1.5B, 7B), Qwen2.5 Instruct (1.5B, 3B, 7B) & llama-3.2 Instruct (1B, 3B, 8B).
Broader LLM model support and more model compression techniques
Further optimization of LoRA adapters in OpenVINO GenAI for improved LLM, VLM, and text-to-image model performance on built-in GPUs. Developers can use LoRA adapters to quickly customize models for specialized tasks.
KV cache compression for CPUs is enabled by default for INT8, providing a reduced memory footprint while maintaining accuracy compared to FP16. Additionally, it delivers substantial memory savings for LLMs with INT4 support compared to INT8.
Optimizations for Intel® Core™ Ultra Processor Series 2 built-in GPUs and Intel® Arc™ B Series Graphics with the Intel® XMX systolic platform to enhance the performance of VLM models and hybrid quantized image generation models, as well as improve first-token latency for LLMs through dynamic quantization.
More portability and performance to run AI at the edge, in the cloud or locally
Enhanced Linux* support with the latest GPU driver for built-in GPUs on Intel® Core™ Ultra Processor Series 2 (formerly codenamed Arrow Lake H).
OpenVINO™ Model Server now offers a streamlined C++ version for Windows and enables improved performance for long-context models through prefix caching, and a smaller Windows package that eliminates the Python dependency. Support for Hugging Face models is now included.
Support for INT4 data-free weights compression for ONNX models implemented in the Neural Network Compression Framework (NNCF).
NPU support for FP16-NF4 precision on Intel® Core™ 200V Series processors for models with up to 8B parameters is enabled through symmetrical and channel-wise quantization, improving accuracy while maintaining performance efficiency.
OpenVINO™ Runtime#
Common#
Better developer experience with shorter build times, due to optimizations and source code refactoring. Code readability has been improved, helping developers understand the components included between different C++ files.
Memory consumption has been optimized by expanding the usage of mmap for the GenAI component and introducing the delayed constant weights mechanism.
Support for ISTFT operator for GPU has been expanded, improving support of text-to-speech, speech-to-text, and speech-to-speech models, like AudioShake and Kokoro.
Models like Behavior Sequence Transformer are now supported, thanks to SparseFillEmptyRows and SegmentMax operators.
google/fnet-base, tf/InstaNet, and more models are now enabled, thanks to DFT operators (discrete Fourier transform) supporting dynamism.
“COMPILED_BLOB” hint property is now available to speed up model compilation. The “COMPILED_BLOB” can be a regular or weightless model. For weightless models, the “WEIGHT_PATH” hint provides location of the model weights.
Reading tensor data from file as copy or using mmap feature is now available.
AUTO Inference Mode#
Memory footprint in model caching has been reduced by loading the model only for the selected plugin, avoiding duplicate model objects.
CPU Device Plugin#
Per-channel INT8 KV cache compression is now enabled by default, helping LLMs maintain accuracy while reducing memory consumption.
Per-channel INT4 KV cache compression is supported and can be enabled using the properties KEY_CACHE_PRECISION and KEY_CACHE_QUANT_MODE. Some models may be sensitive to INT4 KV cache compression.
Performance of encoder-based LLMs has been improved through additional graph-level optimizations, including QKV (Query, Key, and Value) projection and Multi-Head Attention (MHA).
SnapKV support has been implemented in the CPU plugin to reduce KV cache size while maintaining comparable performance. It calculates attention scores in PagedAttention for both prefill and decode stages. This feature is enabled by default in OpenVINO GenAI when KV cache eviction is used.
GPU Device Plugin#
Performance of generative models (e.g. large language models, visual language models, image generation models) has been improved on XMX-based platforms (Intel® Core™ Ultra Processor Series 2 built-in GPUs and Intel® Arc™ B Series Graphics) with dynamic quantization and optimization in GEMM and Convolution.
2nd token latency of INT4 generative models has been improved on Intel® Core™ Processors, Series 1.
LoRa support has been optimized for Intel® Core™ Processor GPUs and its memory footprint improved, by optimizing the OPS nodes dependency.
SnapKV cache rotation now supports accurate token eviction through re-rotation of cache segments that change position after token eviction.
KV cache compression is now available for systolic platforms with an update to micro kernel implementation.
Improvements to Paged Attention performance and functionality have been made, with support of different head sizes for Key and Value in KV-Cache inputs.
NPU Device Plugin#
The NPU Plugin can now retrieve options from the compiler and mark only the corresponding OpenVINO properties as supported.
The model import path now supports passing precompiled models directly to the plugin using the ov::compiled_blob property (Tensor), removing the need for stream access.
The ov::intel_npu::turbo property is now forwarded both to the compiler and the driver when supported. Using NPU_TURBO may result in longer compile time, increased memory footprint, changes in workload latency, and compatibility issues with older NPU drivers.
The same Level Zero context is now used across OpenVINO Cores, enabling remote tensors created through one Core object to be used with inference requests created with another Core object.
BlobContainer has been replaced with regular OpenVINO tensors, simplifying the underlying container for a compiled blob.
Weightless caching and compilation for LLMs are now available when used with OpenVINO GenAI.
LLM accuracy issues with BF16 models have been resolved.
The NPU driver is now included in OpenVINO Docker images for Ubuntu, enabling out-of-the-box NPU support without manual driver installation. For instructions, refer to the OpenVINO Docker documentation.
NPU support for FP16-NF4 precision on Intel® Core™ 200V Series processors for models with up to 8B parameters is enabled through symmetrical and channel-wise quantization, improving accuracy while maintaining performance efficiency. FP16-NF4 is not supported on CPUs and GPUs.
OpenVINO Python API#
Wheel package and source code now include type hinting support (.pyi files), to help Python developers work in IDE. By default, pyi files will be generated automatically but can be triggered manually by developers themselves.
The compiled_blob property has been added to improve work with compiled blobs for NPU.
OpenVINO C API#
A new API function is now available, to read IR models directly from memory.
OpenVINO Node.js API#
OpenVINO GenAI has been expanded for JS package API compliance, to address future LangChain.js user requirements (defined by the LangChain adapter definition).
A new sample has been added, demonstrating OpenVINO GenAI in JS.
PyTorch Framework Support#
Complex numbers in the RoPE pattern, used in Wan2.1 model, are now supported.
OpenVINO™ Model Server#
Major new features:
Image generation endpoint - this preview feature enables image generation based on text prompts. The endpoint is compatible with OpenAI API making it easy to integrate with the existing ecosystem.
Agentic AI enablement via support for tools in LLM models. This preview feature allows easy integration of OpenVINO serving with AI Agents.
Model management via OVMS CLI now includes automatic download of OpenVINO models from Hugging Face Hub. This makes it possible to deploy generative pipelines with just a single command and manage the models without extra scripts or manual steps.
Other improvements:
VLM models with chat/completion endpoint can now support passing the images as URL or as path to a local file system.
Option to use C++ only server version with support for LLM models. This smaller deployment package can be used both for completion and chat/completions.
The following issues have been fixed:
Correct error status now reported in streaming mode.
Known limitations:
VLM models QuenVL2, QwenVL2.5 and Phi3_VL have low accuracy when deployed in a text generation pipeline with continuous batching. It is recommended to deploy these models in a stateful pipeline which processes the requests serially.
Neural Network Compression Framework#
Data-free AWQ (Activation-aware Weight Quantization) method for 4-bit weight compression, nncf.compress_weights(), is now available for OpenVINO models. Now it is possible to compress weights to 4-bit with AWQ even without the dataset.
8-bit and 4-bit data-free weight compression, nncf.compress_weights(), is now available for models in ONNX format. See example.
4-bit data-aware AWQ (Activation-aware Weight Quantization) and Scale Estimation methods are now available for models in the TorchFX format.
TorchFunctionMode-based model tracing is now enabled by default for PyTorch models in nncf.quantize() and nncf.compress_weights().
Neural Low-Rank Adapter Search (NLS) Quantization-Aware Training (QAT) for more accurate 4-bit compression of LLMs on downstream tasks is now available. See example.
Weight compression time for NF4 data type has been reduced.
OpenVINO Tokenizers#
Regex-based normalization and split operations have been optimized, resulting in significant speed improvements, especially for long input strings.
Two-string inputs are now supported, enabling various tasks, including RAG reranking.
Sentencepiece char-level tokenizers are now supported to enhance the SpeechT5 TTS model.
The tokenization node factory has been exposed to enable OpenVINO GenAI GGUF support.
OpenVINO GenAI#
New preview pipelines with C++ and Python samples have been added:
Text2SpeechPipeline,
TextEmbeddingPipeline covering RAG scenario.
Visual language modeling (VLMPipeline):
VLM prompt can now refer to specific images. For example,
<ov_genai_image_0>What’s in the image?
will prepend the corresponding image to the prompt while ignoring other images. See VLMPipeline’s docstrings for more details.VLM uses continuous batching by default, improving performance.
VLMPipeline can now be constructed from in-memory ov::Model.
Qwen2.5-VL support has been added.
JavaScript:
JavaScript samples have been added: beam_search_causal_lm and multinomial_causal_lm.
An interruption option for LLMPipeline streaming has been introduced.
The following has been added:
cache encryption samples demonstrating how to encode OpenVINO’s cached compiled model,
LLM ReAct Agent sample capable of calling external functions during text generation,
SD3 LoRA Adapter support for Text2ImagePipeline,
ov::genai::Tokenizer::get_vocab() method for C++ and Python,
ov::Property as arguments to the ov_genai_llm_pipeline_create function for the C API,
support for the SnapKV method for more accurate KV cache eviction, enabled by default when KV cache eviction is used,
preview support for GGUF models (GGML Unified Format). See the OpenVINO blog for details.
Other Changes and Known Issues#
Jupyter Notebooks#
Known Issues#
Deprecation And Support#
Using deprecated features and components is not advised. They are available to enable a smooth transition to new solutions and will be discontinued in the future. For more details, refer to: OpenVINO Legacy Features and Components.
Discontinued in 2025#
Runtime components:
The OpenVINO property of Affinity API is no longer available. It has been replaced with CPU binding configurations (
ov::hint::enable_cpu_pinning
).The openvino-nightly PyPI module has been discontinued. End-users should proceed with the Simple PyPI nightly repo instead. More information in Release Policy.
Tools:
The OpenVINO™ Development Tools package (pip install openvino-dev) is no longer available for OpenVINO releases in 2025.
Model Optimizer is no longer available. Consider using the new conversion methods instead. For more details, see the model conversion transition guide.
Intel® Streaming SIMD Extensions (Intel® SSE) are currently not enabled in the binary package by default. They are still supported in the source code form.
Legacy prefixes:
l_
,w_
, andm_
have been removed from OpenVINO archive names.
OpenVINO GenAI:
StreamerBase::put(int64_t token)
The
Bool
value for Callback streamer is no longer accepted. It must now return one of three values of StreamingStatus enum.ChunkStreamerBase is deprecated. Use StreamerBase instead.
NNCF
create_compressed_model()
method is now deprecated.nncf.quantize()
method is recommended for Quantization-Aware Training of PyTorch and TensorFlow models.OpenVINO Model Server (OVMS) benchmark client in C++ using TensorFlow Serving API.
Deprecated and to be removed in the future#
Python 3.9 is now deprecated and will be unavailable after OpenVINO version 2025.4.
openvino.Type.undefined
is now deprecated and will be removed with version 2026.0.openvino.Type.dynamic
should be used instead.APT & YUM Repositories Restructure: Starting with release 2025.1, users can switch to the new repository structure for APT and YUM, which no longer uses year-based subdirectories (like “2025”). The old (legacy) structure will still be available until 2026, when the change will be finalized. Detailed instructions are available on the relevant documentation pages:
OpenCV binaries will be removed from Docker images in 2026.
Ubuntu 20.04 support will be deprecated in future OpenVINO releases due to the end of standard support.
“auto shape” and “auto batch size” (reshaping a model in runtime) will be removed in the future. OpenVINO’s dynamic shape models are recommended instead.
MacOS x86 is no longer recommended for use due to the discontinuation of validation. Full support will be removed later in 2025.
The openvino namespace of the OpenVINO Python API has been redesigned, removing the nested openvino.runtime module. The old namespace is now considered deprecated and will be discontinued in 2026.0. A new namespace structure is available for immediate migration. Details will be provided through warnings and documentation.
Legal Information#
You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at www.intel.com or from the OEM or retailer.
No computer system can be absolutely secure.
Intel, Atom, Core, Xeon, OpenVINO, and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. Other names and brands may be claimed as the property of others.
Copyright © 2025, Intel Corporation. All rights reserved.
For more complete information about compiler optimizations, see our Optimization Notice.
Performance varies by use, configuration and other factors.