OpenVINO Release Notes#

2026.2 - 28 May 2026#

System Requirements | Release policy | Installation Guides

What’s New#

More Gen AI coverage and frameworks integrations to minimize code changes
- New models supported: Gemma 4 E2B and Gemma 4 E4B
  - Only on CPUs & GPUs: Qwen3-Coder-Next, Qwen3.5, Qwen3.6, Trinity-mini, LFM2-24B-A2B, LFM2-8B-A1B, LFM2.5-350M
  - Only on CPUs: YOLO26
  - Only on GPUs: Gemma 4 31B and Gemma 4 26B-A4B
  - Extended to GPUs: GPT-OSS-120B
- Scaled Dot-Product Attention (SDPA) path support added for LFM2 models
- Support for Hugging Face Transformers v5.0, ensuring compatibility with the latest model architecture for enhanced interoperability.
Broader LLM model support and more model compression techniques
- OpenVINO™ GenAI introduces extension support for loading custom extension libraries and registering unsupported operations via the extensions property. This gives developers the flexibility to run models with custom ops that OpenVINO doesn’t support out of the box.
- INT4 KV-cache compression is enabled for GPUs, with substantial memory reduction when KV cache size is significant, such as with large input prompts exceeding 32K tokens.
- OpenVINO GenAI significantly reduces model loading times on GPU when using cache blobs — preventing bottlenecks for multi-stage AI pipelines, including agentic use cases that rely on multiple models.
- Optimized IR read mode with independently managed constant buffers to reduce peak memory usage by avoiding unnecessary duplication of weight data unless required for correctness (Linux support added in this release).
- Preview: Enhanced XAttention accuracy on CPUs and GPUs through by-channel INT8 KV-cache quantization (compared to by-token INT8 KV-cache), matching the default by-channel INT8 KV cache quantization when XAttention is not enabled.
More portability and performance to run AI at the edge, in the cloud or locally
- OpenVINO™ GenAI extends its JavaScript API to include a Text-to-Speech pipeline and VLM samples for browser and Node.js developers.
- OpenVINO™ Model Server extends tool-calling support to Qwen 3.5 and 3.6 models to enable agentic AI use cases.
- OpenVINO™ Model Server adds streaming transcription support for speech-to-text, reducing latency for real-time voice applications.
- Preview: Introducing OpenVINO Physical AI, a hardware-accelerated, production‑ready inferencing and deployment framework that standardizes how developers connect cameras, robots, models, and safety controls, reducing brittle custom harnesses and making complex systems easier to build, debug, and evolve on Intel platforms.

OpenVINO™ Runtime#

Common Plugin#

Filesystem path handling in the C++ API has been improved through internal frontend enhancements, eliminating platform-specific path conversion issues and reducing the risk of path-related errors during model loading and deployment workflows.
Introduced properties RUNTIME_REQUIREMENTS and COMPATIBILITY_CHECK which allow users to check if compiled models can be imported by the device before sending them to the OpenVINO runtime.
Reduced peak and average memory consumption on model compilation when using mmap on Linux.
Improved model serialization error handling and report errors when there is no space on disk to store model.
The ov::save_model function now adds runtime attributes containing the OpenVINO version used for saving, enabling better model provenance tracking and version compatibility management.
Constant folding failures have been resolved when unsupported floating-point precision was encountered; operations now automatically fall back to FP32 precision to ensure successful model optimization.

CPU Device Plugin#

Enabled support for the new-generation Qwen3 series models, including Qwen3-Coder-Next, Qwen3.5 and Qwen3.6, with performance optimizations.
Added support and optimization for CausalConv1D and GatedDeltaNet operations with kernel implementation, enabling models such as Qwen3 and LFM2.

GPU Device Plugin#

Enabled support for the new-generation Qwen3 series models, including Qwen3-Coder-Next, Qwen3.5 and Qwen3.6, with performance optimizations.
Enabled parallel loading for model cache blobs, significantly reducing model load time.
Enabled by-channel INT8 KV cache quantization by default for GPU XAttention (sparse attention), delivering improved accuracy and aligning the configuration with the standard path.
INT4 KV cache quantization has been enabled, reducing memory consumption for key-value cache storage during LLM inference.
Improved Qwen3-vl-4b performance for TTFT, TPOT, and model loading.
The GPT-OSS-20B model now supports INT8 weight precision and runs on Intel® Core™ Ultra Series 2 processors (H-SKUs) and Intel® Arc™ GPUs.
Improved ResNet-34 performance on Intel® Xe2 architecture.

NPU Device Plugin#

Integrated NPU compiler version 8.1, splitting the compiler library into openvino_intel_npu_compiler_loader and openvino_intel_npu_compiler components. The loader library can return the list of supported properties and respond to queries about specific properties. The core compiler library is loaded in memory only when a model is compiled.
Added blob encryption and decryption support through ov::cache_encryption_callbacks, enabling secure model caching and export/import workflows. A security warning will be issued when using encryption callbacks with ‘Compiler-in-Driver’ on driver versions up to 32.0.100.4724 due to temporary unencrypted file usage.
Added support for ov::runtime_requirements to expose a human-readable compatibility description of the compiled model. Current support covers ‘Compiler-in-Plugin’ models, with ‘Compiler-in-Driver’ support planned for a future release.
Added support for ov::compatibility_check to check the compatibility of a compiled model based only on description (previously obtained through ov::runtime_requirements). Models are marked as supported in the current OpenVINO release if compiled for the current platform with sufficient device tiles available for execution. The compatibility check does not currently consider driver capabilities and cannot guarantee model acceptance during import. This logic will be adjusted in the upcoming release to rely on Level Zero (UMD) for more extensive compatibility checks when available in future drivers.
Added support for CompiledModel::release_memory(), enabling memory consumption reduction between inferences through NPU driver graph eviction mechanisms.
Implemented Level Zero command queue pooling to enable queue sharing across multiple compiled models, reducing doorbell utilization and recycling overhead.
Added model priority to be changed dynamically for compiled models through the ov::model_priority property after model compilation.
Exposed compiler version information through compiled model properties, providing better traceability for which compiler was automatically selected by the plugin during compilation and preserving this information in cached or exported models.
Added device recovery mechanisms to handle device lost errors, allowing applications to create new ov::Core instances after device resets to resume execution, aligning with future driver recovery capabilities.
Introduced general attention handling optimization, reducing time-to-first-token (TTFT) on most public LLMs for longer prompts (starting at 4K tokens). Use NPUW_PREFILL_ATTENTION_HINT:STATIC to revert to previous behavior if issues occur.
Improved compile time for large models, including LLMs.
Enhanced partitioning stability that previously caused compilation failures for several VLM models.
Enhanced support for GenAI models produced with Transformers 5.x.
Improved second token performance stability for LLMs on Intel® Core™ Ultra Series 2 processors.
Introduced new backward-compatible blob format, currently limited to classic (non-LLM) models. Blobs exported with NPUW_ENSURE_COMPATIBILITY:YES maintain compatibility with future OpenVINO versions for the given NPU architecture.

OpenVINO Node.js API#

Error handling in the Node.js API has been enhanced for Core and AsyncInferQueue classes, providing more robust and predictable exception management during model loading and inference operations.

PyTorch Framework Support#

GPTQ quantized model support has been added via torch.export, enabling conversion of 4-bit GPTQ models (AutoGPTQ) through the torch.export path.
Quantized model conversion with torch.export has been extended to support AWQ and BitNet quantized models in addition to existing TorchScript compatibility.
CELU operation accuracy has been fixed to match PyTorch behavior, resolving numerical precision issues.

ONNX Framework Support#

Scan, Loop, and If operations have been improved to correctly handle models using graph initializers as direct outputs of control flow subgraphs, resolving conversion failures and incorrect results. Scan operation now validates num_scan_inputs and properly handles models where the loop body has fewer outputs than initial state values.
Tokenizer operation support has been added via openvino-tokenizers integration, enabling conversion of ONNX models using StringNormalizer, LabelEncoder, Tokenizer, and TfIdfVectorizer operations when openvino-tokenizers is installed. Users receive guidance to install the package if missing.

OpenVINO™ Model Server#

Performance

Improved performance on Intel Data Center GPU Flex 60 and Flex 70 for Qwen3-30B MoE model family.
Improved multinomial algorithm performance, reducing latency for generation with temperature > 0.
Improved model loading and pipeline initialization performance for new inference requests.

New models and hardware support

Restored support for generative models on hosts with CPUs without Intel® AVX2 instruction set when using supported discrete GPUs.
Added support for Intel® Xe GPUs for MoE models, including Intel® Arc™ A770.
Enabled execution of GPT-OSS-20B with INT8 precision and GPT-OSS-120B with INT4 precision on GPU.
Enabled models and support for MoE for Qwen3.5, Qwen3.6, Qwen3-Coder-Next, and Gemma 4 (without continuous batching).
Fixed chat template rendering for DeepSeek and Granite models when processing non-ASCII characters.
Added tool parsers for Gemma 4 and LFM2 models.

Deployment ease

Improved default performance tuning to use resource constraints in Docker containers, with default number of REST workers, OpenVINO inference streams, threads, and CPU pinning configurations avoiding quota and ulimit settings on Linux to prevent overallocation and performance degradation in Docker and Kubernetes environments.
Enhanced deployment capabilities with local generative model startup options and runtime parameter configuration through CLI, enabling generative model deployment from read-only filesystems with configurable runtime parameters such as target device and cache size for seamless KServe and OpenShift integration.
Improved model pulling recovery mechanisms to resume interrupted Hugging Face model downloads from the previous checkpoint in case of failures or interruptions.

New or improved endpoints capabilities

Added initial support for /responses endpoint.
Fixed server readiness endpoint behavior - /v2/health/ready now correctly reports success when all models are fully initialized and returns appropriate errors when models are not loaded.
Added min_p sampling parameter for enhanced generation control.
Added skip_special_tokens sampling parameter - when set to False, returns raw model responses including special tokens to users.
Fixed default seed parameter to use random values, ensuring non-deterministic responses from LLM models.
Added LoRA adapter support for both image generation models and LLM models.
Added support of streaming for audio/transcriptions endpoint
Introduced OVMS_AUDIO_MAX_FILE_SIZE_BYTES environment variable that controls the upper bound on memory that a single audio request can allocate for decoded data.

Limitations:

Gemma 4 and LFM2 MoE models supported without Continuous Batching
/responses endpoint doesn’t include built-in tools, audio input and multinomial output. There are also no session management capabilities.

Neural Network Compression Framework#

Added support for transpose_a attribute in Scale Estimation compression algorithm.
Added INT4/INT8 weight compression support for Vision-Language-Action (VLA) with Pi0.5 model.

OpenVINO Tokenizers#

Added ONNX Frontend extension with new translators for tokenization-related operations: Label Encoder, StringNormalizer, Tokenizer, TFID Vectorizer.
Updated TensorFlow Frontend extension with AsString operation translator.
Extended Python CLI with new check and diagnose tools.
Reduced binary size on Linux and Mac platforms.

OpenVINO™ GenAI#

Support for hybrid attention models with linear state (CausalConv1D, GatedDeltaNet) for SDPA and PA backend.
VLM models enabled: Qwen3.5, Qwen3.6, Gemma 4 (SDPA backend), VideoChat-Flash.
Hybrid attention text models enabled: LFM2, Qwen3-Coder-Next.
VLMPipeline now supports in-pipeline video sampling with video metadata API, enabling raw video frames input directly without manual frame sampling.
Continuous Batching API now supports images_batches, videos_batches, videos_metadata_batches properties.
Multinomial sampling performance has been improved, and a new min_p sampling parameter has been added for enhanced generation control.
Whisper pipeline results now include a language field indicating the detected or specified language for audio transcription and translation tasks.
LoRA adapters can now be applied to Text2VideoPipeline.
TaylorSeer caching mechanism is now enabled by default for Flux, Stable Diffusion 3, and LTX-Video.
Added new TOOL_CALL : finish_reason for better integration with agentic tools.
Performance metrics now include apply_chat_template() latency measurements for comprehensive chat pipeline profiling.
A new extensions API has been added, enabling direct loading of OpenVINO extensions within GenAI pipelines for custom operation support.
The Node.js API has been updated to include support for Text2SpeechPipeline and Text2ImagePipeline, enabling speech and image generation.
Whisper pipeline on NPU now supports word-level timestamps by default.

OpenVINO™ Physical AI#

Released OpenVINO™ Physical AI, a runtime package for deploying robot policies in real-world environments. This release packages the core deployment stack including camera capture, robot interfaces, exported-policy inference, and runtime loop integration.
Introduced unified camera API supporting UVC, RealSense, Basler, and shared-camera transport workflows.
Implemented robot interfaces for SO-101 and Trossen WidowX AI integrations.
Enabled inference runtime for exported policies with built-in OpenVINO and ONNX backends.
Provided runtime control loop with PolicyRuntime, SyncExecution, and AsyncExecution capabilities.
Included hardware-specific extras for camera and robot integrations.

Other Changes and Known Issues#

Jupyter Notebooks#

New models and use cases:

Known Issues#

Component: CPU/GPU Plugin
ID: 186412
Description:
Gemma 4 INT8/INT4 weight compressed models require explicitly setting KV cache group size to
64 to preserve accuracy. This modification has been included in the latest version of Optimum
Intel for model export.

Component: OpenVINO Runtime
ID: 186160
Description:
Gemma 3, internvl2-4b, minicpm4-0.5b and minicpm4-8b models’ accuracy is reduced when converted
with optimum-intel and Transformers library version 5.0.0+. As a workaround, convert the models
using following Transformers library versions: Gemma 3- Transformers 4.57.6, internvl2-4b-
Transformers 4.51.3, minicpm4-0.5b and minicpm4-8b- Transformers 4.53.3.

Component: NPU Plugin
ID: 214108
Description:
Phi 3.5/4 models accuracy is reduced when converted with optimum-intel and Transformers library
version 5.0.0+. As a workaround, convert the model using Transformers library version 4.57.6.

Component: GPU Plugin
ID: 187445
Description:
Trinity-mini model can demonstrate low accuracy on GPU platforms with INT8 precision.

Component: OpenVINO Runtime
ID: 187019
Description:
Gemma 4 models can demonstrate low accuracy on some platforms.

Component: GPU Plugin
ID: 187077
Description:
YOLO26 fails to compile on GPU.

Component: GPU Plugin
ID: 180852
Description:
GPT-OSS-120B can demonstrate low accuracy on GPU platforms and requires
ACTIVATIONS_SCALE_FACTOR 8.

Previous 2026 releases#

Deprecation And Support#

Using deprecated features and components is not advised. They are available to enable a smooth transition to new solutions and will be discontinued in the future. For more details, refer to: OpenVINO Legacy Features and Components.

Discontinued in 2026#

The deprecated openvino.runtime namespace has been removed. Please use the openvino namespace directly.
The deprecated openvino.Type.undefined has been removed. Please use openvino.Type.dynamic instead.
The PostponedConstant constructor signature has been updated for improved usability:
- Old (removed): Callable[[Tensor], None]
- New: Callable[[], Tensor]
The deprecated OpenVINO™ GenAI predefined generation configs were removed.
The deprecated OpenVINO GenAI support for whisper stateless decoder model has been removed. Please use a stateful model.
The deprecated OpenVINO GenAI StreamerBase put method, bool return type for callbacks, and ChunkStreamer class has been removed.
NNCF create_compressed_model() method is now deprecated and removed in 2026. Please use nncf.prune() method for unstructured pruning and nncf.quantize() for INT8 quantization.
NNCF optimization methods for TensorFlow models and TensorFlow backend in NNCF are deprecated and removed in 2026. It is recommended to use PyTorch analogous models for training-aware optimization methods and OpenVINO™ IR, PyTorch, and ONNX models for post-training optimization methods from NNCF.
The following experimental NNCF methods are deprecated and removed: NAS, Structural Pruning, AutoML, Knowledge Distillation, Mixed-Precision Quantization, Movement Sparsity.
CPU plugin now requires support for the AVX2 instruction set as a minimum system requirement. The SSE instruction set will no longer be supported.
OpenVINO™ migrated builds based on RHEL 8 to RHEL 9.
manylinux2014 upgraded to manylinux_2_28. This aligns with modern toolchain requirements but also means that CentOS 7 will no longer be supported due to glibc incompatibility.

Deprecated and to be removed in the future#

Support for Ubuntu 20.04 has been discontinued due to the end of its standard support.
The openvino-nightly PyPI module will soon be discontinued. End-users should proceed with the Simple PyPI nightly repo instead. Find more information in the Release policy.
auto shape and auto batch size (reshaping a model in runtime) will be removed in the future. OpenVINO™’s dynamic shape models are recommended instead.
MacOS x86 is no longer recommended for use due to the discontinuation of support.
APT & YUM Repositories Restructure: Starting with release 2025.1, users can switch to the new repository structure for APT and YUM, which no longer uses year-based subdirectories (like “2025”). The old (legacy) structure will still be available until 2026, when the change will be finalized. Detailed instructions are available on the relevant documentation pages:
- Installation guide - yum
- Installation guide - apt
OpenCV binaries will be removed from Docker images in 2026.
With the release of Node.js v22, updated Node.js bindings are now available and compatible with the latest LTS version. These bindings do not support CentOS 7, as they rely on newer system libraries unavailable on legacy systems.
Starting with 2026.0 release major internal refactoring of the graph iteration mechanism has been implemented for improved performance and maintainability. The legacy path can be enabled by setting the ONNX_ITERATOR=0 environment variable. This legacy path is deprecated and will be removed in future releases.
OpenVINO Model Server:
- The dedicated OpenVINO operator for Kubernetes and OpenShift is now deprecated in favor of the recommended KServe operator. The OpenVINO operator will remain functional in upcoming OpenVINO Model Server releases but will no longer be actively developed. Since KServe provides broader capabilities, no loss of functionality is expected. On the contrary, more functionalities will be accessible and migration between other serving solutions and OpenVINO Model Server will be much easier.
- TensorFlow Serving (TFS) API support is planned for deprecation. With increasing adoption of the KServe API for classic models and the OpenAI API for generative workloads, usage of the TFS API has significantly declined. Dropping date is to be determined based on the feedback, with a tentative target of mid-2026.
- Support for Stateful models will be deprecated. These capabilities were originally introduced for Kaldi audio models which is no longer relevant. Current audio models support relies on the OpenAI API, and pipelines implemented via OpenVINO GenAI library.
- Directed Acyclic Graph Scheduler will be deprecated in favor of pipelines managed by MediaPipe scheduler and will be removed in 2026.3. That approach gives more flexibility, includes wider range of calculators and has support for using processing accelerators.
OpenVINO™ GenAI:
- start_chat() / finish_chat() APIs are deprecated and will be removed in a future major release. Pass a ChatHistory object directly to generate() instead.

Legal Information#

You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at www.intel.com or from the OEM or retailer.

No computer system can be absolutely secure.

Intel, Atom, Core, Xeon, OpenVINO, and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. Other names and brands may be claimed as the property of others.

For more complete information about compiler optimizations, see our Optimization Notice.

Performance varies by use, configuration and other factors.