OpenVINO Release Notes#
2025.1 - 09 April 2025#
System Requirements | Release policy | Installation Guides
What’s new#
More Gen AI coverage and frameworks integrations to minimize code changes
New models supported: Phi-4 Mini, Jina CLIP v1, and Bce Embedding Base v1.
OpenVINO™ Model Server now supports VLM models, including Qwen2-VL, Phi-3.5-Vision, and InternVL2.
OpenVINO GenAI now includes image-to-image and inpainting features for transformer-based pipelines, such as Flux.1 and Stable Diffusion 3 models, enhancing their ability to generate more realistic content.
Preview: AI Playground now utilizes the OpenVINO Gen AI backend to enable highly optimized inferencing performance on AI PCs.
Broader LLM model support and more model compression techniques
Reduced binary size through optimization of the CPU plugin and removal of the GEMM kernel.
Optimization of new kernels for the GPU plugin significantly boosts the performance of Long Short-Term Memory (LSTM) models, used in many applications, including speech recognition, language modeling, and time series forecasting.
Preview: Token Eviction implemented in OpenVINO GenAI to reduce the memory consumption of KV Cache by eliminating unimportant tokens. This current Token Eviction implementation is beneficial for tasks where a long sequence is generated, such as chatbots and code generation.
NPU acceleration for text generation is now enabled in OpenVINO™ Runtime and OpenVINO™ Model Server to support the power-efficient deployment of VLM models on NPUs for AI PC use cases with low concurrency.
More portability and performance to run AI at the edge, in the cloud or locally
Support for the latest Intel® Core™ processors (Series 2, formerly codenamed Bartlett Lake), Intel® Core™ 3 Processor N-series and Intel® Processor N-series (formerly codenamed Twin Lake) on Windows.
Additional LLM performance optimizations on Intel® Core™ Ultra 200H series processors for improved 2nd token latency on Windows and Linux.
Enhanced performance and efficient resource utilization with the implementation of Paged Attention and Continuous Batching by default in the GPU plugin.
Preview: The new OpenVINO backend for Executorch will enable accelerated inference and improved performance on Intel hardware, including CPUs, GPUs, and NPUs.
OpenVINO™ Runtime#
Common#
Delayed weight compression is now available - compressed weights are not stored in memory but saved to a file immediately after compression to control memory consumption.
Register extensions per frontend (update for extension API)
mmaped tensors havve been added, to read ov::Tensor from file on disk using mmap and help reduce memory consumption in some scenarios, for example, when using LoRa adapters in GenAI.
CPU Device Plugin#
Dynamic quantization of Fully Connected layers with asymmetric weights is now enabled on Intel AVX2 platforms, improving out-of-the-box performance for 8bit/4bit asymmetric weight-compressed LLMs.
Performance of weight compressed LLMs for long prompts has been optimized on Intel client and Xeon platforms, especially on 1st token latency.
Optimization of QKV (Query, Key, and Value) projection and MLP (Multilayer Perceptrons) fusion for LLMs has been extended to support BF16 on Windows OS for performance improvements on AMX platforms.
GEMM kernel has been removed from the OpenVINO CPU library, reducing its size.
FP8 (alias for f8e4m3 and f8e5m2) model support has been enhanced with optimized FakeConvert operator. Compilation time for FP8 LLMs has also been improved.
GPU Device Plugin#
Second token latency of large language models has been improved on all GPU platforms with optimization of translation lookaside buffer (TLB) scenario and Group Query Attention (GQA).
First token latency of large language models has been improved on Intel Core Ultra Processors Series 2 with Paged Attention optimization.
Int8 compressed KV-cache is enabled for LLMs by default on all GPU platforms.
Performance of VLM (visual language models) has been improved on GPU platforms with XMX (Xe Matrix eXtensions).
NPU Device Plugin#
Support for LLM weightless caching and encryption of LLM blobs.
When a model is imported from cache, you can now use
ov::internal::cached_model_buffer
to reduce memory footprint.NF4 (4-bit NormalFloat) inputs/outputs are now supported. E2E support depends on the driver version.
The following issues have been fixed:
for stateful models: update level zero command list when tensor is relocated.
for zeContextDestroy error that occurred when applications were using static ov::Cores.
OpenVINO Python API#
Ability to create a Tensor directly from a Pillow image, eliminating the need for casting it to a NumPy array first.
Optimization of memory consumption for export_model, read_model, and compile_model methods.
OpenVINO Node.js API#
Node.js bindings for OpenVINO GenAI are now available in the genai-node npm package and bring the simplicity of OpenVINO GenAI API to Node.js applications.
PyTorch Framework Support#
PyTorch version 2.6 is now supported.
Common translators have been implemented to unify decompositions for operations of multiple frameworks (PyTorch, TensorFlow, ONNX, JAX) and to support complex tensors.
FP8 model conversion is now supported.
Conversion of TTS models containing STFT/ISTFT operators has been enabled.
JAX Framework Support#
JAX 0.5.2 and Flax 0.10.4 have been added to validation.
Keras 3 Multi-backend Framework Support#
Keras 3.9.0 is now supported.
Provided more granular test exclusion mechanism for convenient enabling per operation.
TensorFlow Lite Framework Support#
Enabled support for models which use quantized tensors between layers in runtime.
OpenVINO Model Server#
Major new features:
VLM support with continuous batching - the endpoint chat/completion has been extended to support vision models. Now it is possible to send images in the context of chat. Vision models can be deployed like the LLM models.
NPU acceleration for text generation - now it is possible to deploy LLM and VLM models on NPU accelerator. Text generation will be exposed over completions and chat/completions endpoints. From the client perspective it works the same way as in GPU and CPU deployment, however it doesn’t use the continuous batching algorithm, and target is AI PC use cases with low concurrency.
Other improvements
Model management improvements - mediapipe graphs and generative endpoints can be now started just using command line parameters without the configuration file. Configuration file Json structure for models and graphs has been unified under the models_config_list section.
Updated scalability demonstration using multiple instances, see the demo.
Increased allowed number of stop words in a request from 4 to 16.
Integration with the Visual Studio Code extension of Continue has been enabled making it possible to use the assistance of local AI service while writing code.
Performance improvements - enhancements in OpenVINO Runtime and also text sampling generation algorithm which should increase the throughput in high concurrency load scenarios.
Breaking changes
gRPC server is now optional. There is no default gRPC port set. The
--port
parameter is mandatory to start the gRPC server. It is possible to start REST API server only with the--rest_port
parameter. At least one port number needs to be defined to start OVMS server from CLI (–port or –rest_port). Starting OVMS server via C API calls does not require any port to be defined.
The following issues have been fixed:
Handling of the LLM context length - OVMS will now stop generating the text when model context is exceeded. An error will be raised when the prompt is longer from the context or when the max_tokens plus the input tokens exceeds the model context. In addition, it is possible to constrain the max number of generated tokens for all users of the model.
Security and stability improvements.
Cancellation of LLM generation without streaming.
Known limitations
Chat/completions accepts images encoded to base64 format but not as URL links.
Neural Network Compression Framework#
Preview support for the Quantization-Aware Training (QAT) with LoRA adapters for more accurate 4-bit weight compression of LLMs in PyTorch. The
nncf.compress_weight
API has been extended by a newcompression_format
option:CompressionFormat.FQ_LORA
, for this QAT method. To see how it works, see the sample.Added Activation-aware Weight Quantization and Scale Estimation data-aware 4-bit compression methods for PyTorch backend. Now the compression of LLMs can directly be applied to PyTorch models to speed up the process.
Reduced Generative Pre-trained Transformers Quantization (GPTQ) compression time and peak memory usage.
Reduced compression time and peak memory usage of data-free mixed precision weight compression.
New tracing for PyTorch models based on TorchFunctionMode for
nncf.quantize
andnncf.compress_weights
, which does not require torch namespace fixes. Disabled by default, it can be enabled by the environment variable"NNCF_EXPERIMENTAL_TORCH_TRACING=1”
.Multiple improvements in TorchFX backend to comply with the Torch AO guidelines:
The constant folding pass is removed from the OpenVINO Quantizer and the
quantize_pt2e
function.Support for dynamic shape TorchFX models.
Initial steps to adopt custom quantizers in quantize_pt2e within NNCF:
The hardware configuration is generalized with the narrow_range parameter.
The quantizer parameter calculation code is refactored to explicitly depend on narrow_range.
Preview support of the OpenVINO backend in ExecuTorch has been introduced, model quantization is implemented via the function: nncf.experimental.torch.fx.quantize_pt2e.
PyTorch version 2.6 is now supported.
OpenVINO Tokenizers#
Support for Unigram tokenization models.
Build OpenVINO Tokenizers with installed ICU (International Components for Unicode) plugin for reduced binary size.
max_length and padding rule parameters can be dynamically adjusted with Tokenizer class from OpenVINO GenAI.
Remove fast_tokenizer dependency, no core_tokenizers binary in the OpenVINO Tokenizers distribution anymore.
OpenVINO.GenAI#
The following has been added:
Preview support for the Token Eviction mechanism for more efficient KVCache memory management of LLMs during text generation. Disabled by default. See the sample.
LLMPipeline C bindings and JavaScript bindings.
StreamerBase::write(int64_t token) and StreamerBase::write(const std::vector<int64_t>& tokens).
Phi-3-vision-128k-instruct and Phi-3.5-vision-instruct support for VLMPipeline.
Added Image2image and inpainting pipelines that support FLUX and Stable-Diffusion-3.
LLMPipeline now uses Paged Attention backend by default.
Streaming is now performed in a separate thread while the next token is being inferred by LLM.
Chat template is applied even with disabled chat mode. Use the
apply_chat_template
flag to disable chat template in GenerationConfig.Time consuming methods now release Global Interpreter Lock (GIL).
Other Changes and Known Issues#
Windows PDB Archives: Archives containing PDB files for Windows packages are now available. You can find them right next to the regular archives, in the same folder.
Jupyter Notebooks#
Known Issues#
sudo rmmod intel_vpu
sudo modprobe intel_vpu
.
A rollback to an earlier version of Linux NPU driver will also work.add_request()
and step()
API in multiple threads, the resulting
text is not correct.Deprecation And Support#
Using deprecated features and components is not advised. They are available to enable a smooth transition to new solutions and will be discontinued in the future. To keep using discontinued features, you will have to revert to the last LTS OpenVINO version supporting them. For more details, refer to: OpenVINO Legacy Features and Components.
Discontinued in 2025#
Runtime components:
The OpenVINO property of Affinity API is no longer available. It has been replaced with CPU binding configurations (
ov::hint::enable_cpu_pinning
).The openvino-nightly PyPI module has been discontinued. End-users should proceed with the Simple PyPI nightly repo instead. More information in Release Policy.
Tools:
The OpenVINO™ Development Tools package (pip install openvino-dev) is no longer available for OpenVINO releases in 2025.
Model Optimizer is no longer available. Consider using the new conversion methods instead. For more details, see the model conversion transition guide.
Intel® Streaming SIMD Extensions (Intel® SSE) are currently not enabled in the binary package by default. They are still supported in the source code form.
Legacy prefixes: l_, w_, and m_ have been removed from OpenVINO archive names.
OpenVINO GenAI:
StreamerBase::put(int64_t token)
The
Bool
value for Callback streamer is no longer accepted. It must now return one of three values of StreamingStatus enum.ChunkStreamerBase is deprecated. Use StreamerBase instead.
NNCF
create_compressed_model()
method is now deprecated.nncf.quantize()
method is recommended for Quantization-Aware Training of PyTorch and TensorFlow models.OpenVINO Model Server (OVMS) benchmark client in C++ using TensorFlow Serving API.
Deprecated and to be removed in the future#
openvino.Type.undefined
is now deprecated and will be removed with version 2026.0.openvino.Type.dynamic
should be used instead.APT & YUM Repositories Restructure: Starting with release 2025.1, users can switch to the new repository structure for APT and YUM, which no longer uses year-based subdirectories (like “2025”). The old (legacy) structure will still be available until 2026, when the change will be finalized. Detailed instructions are available on the relevant documentation pages:
OpenCV binaries will be removed from Docker images in 2026.
Ubuntu 20.04 support will be deprecated in future OpenVINO releases due to the end of standard support.
“auto shape” and “auto batch size” (reshaping a model in runtime) will be removed in the future. OpenVINO’s dynamic shape models are recommended instead.
MacOS x86 is no longer recommended for use due to the discontinuation of validation. Full support will be removed later in 2025.
The openvino namespace of the OpenVINO Python API has been redesigned, removing the nested openvino.runtime module. The old namespace is now considered deprecated and will be discontinued in 2026.0.
Legal Information#
You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at www.intel.com or from the OEM or retailer.
No computer system can be absolutely secure.
Intel, Atom, Core, Xeon, OpenVINO, and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. Other names and brands may be claimed as the property of others.
Copyright © 2025, Intel Corporation. All rights reserved.
For more complete information about compiler optimizations, see our Optimization Notice.
Performance varies by use, configuration and other factors.