OpenVINO Release Notes#
2026.2 - 28 May 2026#
System Requirements | Release policy | Installation Guides
What’s New#
More Gen AI coverage and frameworks integrations to minimize code changes
New models supported: Gemma 4 E2B and Gemma 4 E4B
Only on CPUs & GPUs: Qwen3-Coder-Next, Qwen3.5, Qwen3.6, Trinity-mini, LFM2-24B-A2B, LFM2-8B-A1B, LFM2.5-350M
Only on CPUs: YOLO26
Only on GPUs: Gemma 4 31B and Gemma 4 26B-A4B
Extended to GPUs: GPT-OSS-120B
Scaled Dot-Product Attention (SDPA) path support added for LFM2 models
Support for Hugging Face Transformers v5.0, ensuring compatibility with the latest model architecture for enhanced interoperability.
Broader LLM model support and more model compression techniques
OpenVINO™ GenAI introduces extension support for loading custom extension libraries and registering unsupported operations via the extensions property. This gives developers the flexibility to run models with custom ops that OpenVINO doesn’t support out of the box.
INT4 KV-cache compression is enabled for GPUs, with substantial memory reduction when KV cache size is significant, such as with large input prompts exceeding 32K tokens.
OpenVINO GenAI significantly reduces model loading times on GPU when using cache blobs — preventing bottlenecks for multi-stage AI pipelines, including agentic use cases that rely on multiple models.
Optimized IR read mode with independently managed constant buffers to reduce peak memory usage by avoiding unnecessary duplication of weight data unless required for correctness (Linux support added in this release).
Preview: Enhanced XAttention accuracy on CPUs and GPUs through by-channel INT8 KV-cache quantization (compared to by-token INT8 KV-cache), matching the default by-channel INT8 KV cache quantization when XAttention is not enabled.
More portability and performance to run AI at the edge, in the cloud or locally
OpenVINO™ GenAI extends its JavaScript API to include a Text-to-Speech pipeline and VLM samples for browser and Node.js developers.
OpenVINO™ Model Server extends tool-calling support to Qwen 3.5 and 3.6 models to enable agentic AI use cases.
OpenVINO™ Model Server adds streaming transcription support for speech-to-text, reducing latency for real-time voice applications.
Preview: Introducing OpenVINO Physical AI, a hardware-accelerated, production‑ready inferencing and deployment framework that standardizes how developers connect cameras, robots, models, and safety controls, reducing brittle custom harnesses and making complex systems easier to build, debug, and evolve on Intel platforms.
OpenVINO™ Runtime#
Common Plugin#
Filesystem path handling in the C++ API has been improved through internal frontend enhancements, eliminating platform-specific path conversion issues and reducing the risk of path-related errors during model loading and deployment workflows.
Introduced properties
RUNTIME_REQUIREMENTSandCOMPATIBILITY_CHECKwhich allow users to check if compiled models can be imported by the device before sending them to the OpenVINO runtime.Reduced peak and average memory consumption on model compilation when using mmap on Linux.
Improved model serialization error handling and report errors when there is no space on disk to store model.
The
ov::save_modelfunction now adds runtime attributes containing the OpenVINO version used for saving, enabling better model provenance tracking and version compatibility management.Constant folding failures have been resolved when unsupported floating-point precision was encountered; operations now automatically fall back to FP32 precision to ensure successful model optimization.
CPU Device Plugin#
Enabled support for the new-generation Qwen3 series models, including Qwen3-Coder-Next, Qwen3.5 and Qwen3.6, with performance optimizations.
Added support and optimization for CausalConv1D and GatedDeltaNet operations with kernel implementation, enabling models such as Qwen3 and LFM2.
GPU Device Plugin#
Enabled support for the new-generation Qwen3 series models, including Qwen3-Coder-Next, Qwen3.5 and Qwen3.6, with performance optimizations.
Enabled parallel loading for model cache blobs, significantly reducing model load time.
Enabled by-channel INT8 KV cache quantization by default for GPU XAttention (sparse attention), delivering improved accuracy and aligning the configuration with the standard path.
INT4 KV cache quantization has been enabled, reducing memory consumption for key-value cache storage during LLM inference.
Improved Qwen3-vl-4b performance for TTFT, TPOT, and model loading.
The GPT-OSS-20B model now supports INT8 weight precision and runs on Intel® Core™ Ultra Series 2 processors (H-SKUs) and Intel® Arc™ GPUs.
Improved ResNet-34 performance on Intel® Xe2 architecture.
NPU Device Plugin#
Integrated NPU compiler version 8.1, splitting the compiler library into
openvino_intel_npu_compiler_loaderandopenvino_intel_npu_compilercomponents. The loader library can return the list of supported properties and respond to queries about specific properties. The core compiler library is loaded in memory only when a model is compiled.Added blob encryption and decryption support through
ov::cache_encryption_callbacks, enabling secure model caching and export/import workflows. A security warning will be issued when using encryption callbacks with ‘Compiler-in-Driver’ on driver versions up to 32.0.100.4724 due to temporary unencrypted file usage.Added support for
ov::runtime_requirementsto expose a human-readable compatibility description of the compiled model. Current support covers ‘Compiler-in-Plugin’ models, with ‘Compiler-in-Driver’ support planned for a future release.Added support for
ov::compatibility_checkto check the compatibility of a compiled model based only on description (previously obtained throughov::runtime_requirements). Models are marked as supported in the current OpenVINO release if compiled for the current platform with sufficient device tiles available for execution. The compatibility check does not currently consider driver capabilities and cannot guarantee model acceptance during import. This logic will be adjusted in the upcoming release to rely on Level Zero (UMD) for more extensive compatibility checks when available in future drivers.Added support for
CompiledModel::release_memory(), enabling memory consumption reduction between inferences through NPU driver graph eviction mechanisms.Implemented Level Zero command queue pooling to enable queue sharing across multiple compiled models, reducing doorbell utilization and recycling overhead.
Added model priority to be changed dynamically for compiled models through the
ov::model_priorityproperty after model compilation.Exposed compiler version information through compiled model properties, providing better traceability for which compiler was automatically selected by the plugin during compilation and preserving this information in cached or exported models.
Added device recovery mechanisms to handle device lost errors, allowing applications to create new
ov::Coreinstances after device resets to resume execution, aligning with future driver recovery capabilities.Introduced general attention handling optimization, reducing time-to-first-token (TTFT) on most public LLMs for longer prompts (starting at 4K tokens). Use
NPUW_PREFILL_ATTENTION_HINT:STATICto revert to previous behavior if issues occur.Improved compile time for large models, including LLMs.
Enhanced partitioning stability that previously caused compilation failures for several VLM models.
Enhanced support for GenAI models produced with Transformers 5.x.
Improved second token performance stability for LLMs on Intel® Core™ Ultra Series 2 processors.
Introduced new backward-compatible blob format, currently limited to classic (non-LLM) models. Blobs exported with
NPUW_ENSURE_COMPATIBILITY:YESmaintain compatibility with future OpenVINO versions for the given NPU architecture.
OpenVINO Node.js API#
Error handling in the Node.js API has been enhanced for
CoreandAsyncInferQueueclasses, providing more robust and predictable exception management during model loading and inference operations.
PyTorch Framework Support#
GPTQ quantized model support has been added via torch.export, enabling conversion of 4-bit GPTQ models (AutoGPTQ) through the torch.export path.
Quantized model conversion with torch.export has been extended to support AWQ and BitNet quantized models in addition to existing TorchScript compatibility.
CELU operation accuracy has been fixed to match PyTorch behavior, resolving numerical precision issues.
ONNX Framework Support#
Scan, Loop, and If operations have been improved to correctly handle models using graph initializers as direct outputs of control flow subgraphs, resolving conversion failures and incorrect results. Scan operation now validates
num_scan_inputsand properly handles models where the loop body has fewer outputs than initial state values.Tokenizer operation support has been added via openvino-tokenizers integration, enabling conversion of ONNX models using StringNormalizer, LabelEncoder, Tokenizer, and TfIdfVectorizer operations when openvino-tokenizers is installed. Users receive guidance to install the package if missing.
OpenVINO™ Model Server#
Performance
Improved performance on Intel Data Center GPU Flex 60 and Flex 70 for Qwen3-30B MoE model family.
Improved multinomial algorithm performance, reducing latency for generation with temperature > 0.
Improved model loading and pipeline initialization performance for new inference requests.
New models and hardware support
Restored support for generative models on hosts with CPUs without Intel® AVX2 instruction set when using supported discrete GPUs.
Added support for Intel® Xe GPUs for MoE models, including Intel® Arc™ A770.
Enabled execution of GPT-OSS-20B with INT8 precision and GPT-OSS-120B with INT4 precision on GPU.
Enabled models and support for MoE for Qwen3.5, Qwen3.6, Qwen3-Coder-Next, and Gemma 4 (without continuous batching).
Fixed chat template rendering for DeepSeek and Granite models when processing non-ASCII characters.
Added tool parsers for Gemma 4 and LFM2 models.
Deployment ease
Improved default performance tuning to use resource constraints in Docker containers, with default number of REST workers, OpenVINO inference streams, threads, and CPU pinning configurations avoiding quota and ulimit settings on Linux to prevent overallocation and performance degradation in Docker and Kubernetes environments.
Enhanced deployment capabilities with local generative model startup options and runtime parameter configuration through CLI, enabling generative model deployment from read-only filesystems with configurable runtime parameters such as target device and cache size for seamless KServe and OpenShift integration.
Improved model pulling recovery mechanisms to resume interrupted Hugging Face model downloads from the previous checkpoint in case of failures or interruptions.
New or improved endpoints capabilities
Added initial support for
/responsesendpoint.Fixed server readiness endpoint behavior -
/v2/health/readynow correctly reports success when all models are fully initialized and returns appropriate errors when models are not loaded.Added
min_psampling parameter for enhanced generation control.Added
skip_special_tokenssampling parameter - when set to False, returns raw model responses including special tokens to users.Fixed default seed parameter to use random values, ensuring non-deterministic responses from LLM models.
Added LoRA adapter support for both image generation models and LLM models.
Added support of streaming for audio/transcriptions endpoint
Introduced
OVMS_AUDIO_MAX_FILE_SIZE_BYTESenvironment variable that controls the upper bound on memory that a single audio request can allocate for decoded data.
Limitations:
Gemma 4 and LFM2 MoE models supported without Continuous Batching
/responsesendpoint doesn’t include built-in tools, audio input and multinomial output. There are also no session management capabilities.
Neural Network Compression Framework#
Added support for
transpose_aattribute in Scale Estimation compression algorithm.Added INT4/INT8 weight compression support for Vision-Language-Action (VLA) with Pi0.5 model.
OpenVINO Tokenizers#
Added ONNX Frontend extension with new translators for tokenization-related operations: Label Encoder, StringNormalizer, Tokenizer, TFID Vectorizer.
Updated TensorFlow Frontend extension with AsString operation translator.
Extended Python CLI with new
checkanddiagnosetools.Reduced binary size on Linux and Mac platforms.
OpenVINO™ GenAI#
Support for hybrid attention models with linear state (CausalConv1D, GatedDeltaNet) for SDPA and PA backend.
VLM models enabled: Qwen3.5, Qwen3.6, Gemma 4 (SDPA backend), VideoChat-Flash.
Hybrid attention text models enabled: LFM2, Qwen3-Coder-Next.
VLMPipeline now supports in-pipeline video sampling with video metadata API, enabling raw video frames input directly without manual frame sampling.
Continuous Batching API now supports
images_batches,videos_batches,videos_metadata_batchesproperties.Multinomial sampling performance has been improved, and a new
min_psampling parameter has been added for enhanced generation control.Whisper pipeline results now include a language field indicating the detected or specified language for audio transcription and translation tasks.
LoRA adapters can now be applied to Text2VideoPipeline.
TaylorSeer caching mechanism is now enabled by default for Flux, Stable Diffusion 3, and LTX-Video.
Added new
TOOL_CALL:finish_reasonfor better integration with agentic tools.Performance metrics now include
apply_chat_template()latency measurements for comprehensive chat pipeline profiling.A new extensions API has been added, enabling direct loading of OpenVINO extensions within GenAI pipelines for custom operation support.
The Node.js API has been updated to include support for Text2SpeechPipeline and Text2ImagePipeline, enabling speech and image generation.
Whisper pipeline on NPU now supports word-level timestamps by default.
OpenVINO™ Physical AI#
Released OpenVINO™ Physical AI, a runtime package for deploying robot policies in real-world environments. This release packages the core deployment stack including camera capture, robot interfaces, exported-policy inference, and runtime loop integration.
Introduced unified camera API supporting UVC, RealSense, Basler, and shared-camera transport workflows.
Implemented robot interfaces for SO-101 and Trossen WidowX AI integrations.
Enabled inference runtime for exported policies with built-in OpenVINO and ONNX backends.
Provided runtime control loop with
PolicyRuntime,SyncExecution, andAsyncExecutioncapabilities.Included hardware-specific extras for camera and robot integrations.
Other Changes and Known Issues#
Jupyter Notebooks#
New models and use cases:
Known Issues#
Previous 2026 releases#
2026.1 - 7 April 2026
What’s New
More Gen AI coverage and frameworks integrations to minimize code changes
New models supported on CPUs & GPUs: Qwen3 VL
New models supported on CPUs: GPT-OSS-120B
Preview: Introducing the OpenVINO backend for llama.cpp, which enables optimized inference on Intel CPUs, GPUs, and NPUs. Validated on GGUF models such as Llama-3.2-1B-Instruct-GGUF, Phi-3-mini-4k-instruct-gguf, Qwen2.5-1.5B-Instruct-GGUF, and Mistral-7B-Instruct-v0.3.
New notebook: Unified VLM chatbot with video file support and interactive model switching across Qwen3-VL, Qwen2.5-VL, and LLaVa-NeXT-Video.
Broader LLM model support and more model compression techniques
OpenVINO™ GenAI adds TaylorSeer Lite caching for image and video generation, accelerating diffusion-transformer inference across Flux, SD3, and LTX-Video pipelines, aligned with Hugging Face Diffusers.
LTX-Video generation on GPU achieves end-to-end acceleration through fusion of RMSNorm and RoPE operators, significantly improving video generation performance.
OpenVINO™ GenAI adds dynamic LoRA support for Qwen3-VL and VL models with LLM, allowing developers to swap adapters at runtime for efficient serving of multiple model variants in production without reloading the base model.
Preview: The release-weights API for ov::Model enables memory reclamation during model compilation on NPUs, delivering dramatically lower peak memory consumption for edge and client deployments. Users must set this property in ov::Model, and it will be applied during compilation.
More portability and performance to run AI at the edge, in the cloud or locally
Introducing support for Intel® Core™ Series 3 processors (formerly codenamed Wildcat Lake) and Intel® Arc™ Pro B70 Graphics with 32GB memory for single-GPU inference on 20-30B parameter LLMs.
Prompt Lookup Decoding extended to vision-language pipelines, delivering significantly faster token generation for multimodal workloads on Intel CPUs and GPUs.
OpenVINO™ GenAI now has a smaller runtime footprint after eliminating ICU DLL dependencies from tokenization, leading to reduced memory usage, faster startup, and easier deployment.
OpenVINO GenAI introduces WhisperPipeline for Node.js via its NPM package, delivering production-ready speech recognition with word-level audio-to-text transcription.
OpenVINO™ Model Server enhances support for Qwen3-MOE and GPT-OSS-20B models, delivering improved performance, accuracy, and robust concurrent request handling with continuous batching. These pre-optimized models are available on Hugging Face for easy deployment. Additionally, the Model Server introduces image inpainting and outpainting capabilities via the /image endpoint for AI image editing.
OpenVINO™ Runtime
Common
Introduced new properties:
CACHE_PATHis fully compatible toCACHE_DIRbut natively supportsstd::filesystem::path.CACHE_BLOB_ID(preview) - Allows users to specify a custom ID for the compiled model in cache. This can accelerate model import times, but users must ensure ID uniqueness to prevent collisions.
Improve error messages in
IStreamsExecutor::Config::set_propertyImprove
ov::util::ConstantWriterclass functionality to reduce chance of introducing bug in hash calculationsFix static resource cleanup by allowing custom cleanup functions to be registered for OpenVINO™ components during library unloading.
Fix duplicate hash generation for different
ov::Model, eliminating unnecessary model recompilation when OpenVINO caching is enabled.Fix memory leaks and application crashes related to using CRT library (Windows).
CPU Device Plugin
Model inference performance has been optimized on Intel® Core™ Ultra Series 3 with 2 P-cores + 4 LPE-cores.
XAttention now maintains proper accuracy when enabled, resolving previous accuracy issues.
The accuracy issue with long prompt input has been fixed.
Upgraded oneDNN version to v3.10.
Improve Gemma3 image comprehension by implementing a custom attention mask pattern.
GPU Device Plugin
Performance has been improved for LTX-Video model.
Preview: Experimental L0 backend support for Xe2+ GPUs. Documentation is provided here
Memory optimization for SD3.5 Flash.
XAttention (Block Sparse Attention with Antidiagonal Scoring) is now initially supported on Intel’s Xe1 architecture to improve time-to-first token. (Xe2/3 is already supported). Performance has been improved for low threshold scenarios.
NPU Device Plugin
Batching changes in NPU plugin:
Eliminated the unconditional model clone in the Plugin batching path to reduce memory usage. The model is no longer cloned until after the initial Plugin batch-related checks have been performed.
Input and output layouts must be specified for Plugin batching to be applied. If layouts are not provided, the model will be compiled as-is, without any preliminary batch processing in the NPU Plugin.
Support for IO strides has been added. During model compilation, users can specify which input/output ports should accept tensors with strides using the new property:
ov::intel_npu::enable_strides_for. All desired IO ports must be selected at compilation time. Support for all IO ports is not enabled by default since it can reduce model performance. This feature is supported only with the NPU driver, starting from 32.0.100.4621 (Windows driver) or 1.30 (Linux driver). NPU Plugin will report the property as supported only when dependencies are met. Applications should first check if this property is supported.Fixed accuracy issues with INT8-ASYM Vocabulary on Gemma-2, also improved performance for this path.
Introduced Flash Attention support on for Intel® Core™ Ultra Series 3, allows faster LLM compilation for NPU with longer contexts.
Improved Long RoPE support for Phi models.
OpenVINO Python API
Fixed
ov::Tensorcreation from NumPy scalar data.Added support for
pathlib.Pathobjects inCore.add_extensionandFrontEndManager.register_front_endto unify handling of file path arguments.
OpenVINO Node.js API
The
openvino-genai-nodeNPM package has been updated to include the following improvements:Implemented WhisperPipeline: audio-to-text pipeline with word-level transcription, allowing users to generate precise and detailed speech recognition results.
New method getGenerationConfig is now available for LLMPipeline, VLMPipeline, and WhisperPipeline, allowing users to quickly retrieve default configuration values and streamline their setup process.
ChatHistory is now supported in VLMPipeline, enabling users to manage conversation context more effectively during generation.
Async error handling in LLMPipeline has been refactored, preserving existing behavior while improving internal error processing during asynchronous calls, resulting in a more stable and reliable user experience.
PyTorch Framework Support
The torch.export path has been significantly improved, aligning TorchFX operation coverage with the TorchScript path.
Support for float16 and bfloat16 data types has been added for constant input value extraction.
ONNX Framework Support
The Attention operation (Opset 23) is now supported, including multi-head, group query, and multi-query attention modes with KV caching, causal masking, softcap, and boolean/float attention masks. This enables direct conversion of transformer-based ONNX models without manual decomposition.
Sequence data type support has been extended with SequenceConstruct, SequenceEmpty, SequenceInsert, and ConcatFromSequence operations, enabling loop-based sequence accumulation patterns commonly found in control-flow models.
Support for FP8 quantization types (f8e4m3, f8e5m2) and block-wise quantization has been added to the QuantizeLinear and DequantizeLinear operations.
TensorFlow Framework Support
An issue in NonMaxSuppressionV2 where the iou_threshold parameter was ignored has been fixed.
TensorFlow Lite Framework Support
The TransposeConv operation has been fixed to correctly apply bias inputs.
OpenVINO™ Model Server
Enhanced support for Qwen3-MOE models and GPT-OSS-20B delivers improved performance, accuracy, and robust concurrent request handling with continuous batching capabilities. These models are now available in pre-optimized OpenVINO™ format directly on the Hugging Face hub, making it very easy to deploy them.
Added support for Qwen3-VL models with function calling capabilities, enabling this vision language model in agentic scenarios.
Extended
/imageendpoint to support inpainting and outpainting capabilities. It is now possible to pass the input image along with a mask to edit parts of the image or to extend the input image.Other improvements and fixes:
Server logs now report current KV cache allocation alongside current usage metrics. With dynamic cache size (default setting), allocation automatically scales during runtime based on the request’s concurrency and processed context length.
Generation request cancellation is now supported for NPU devices, where requests from disconnected clients will be cancelled.
The finish reason now returns
tool_callswhen the model generates a function call, in line with OpenAI API standards.Corrected tokens usage reporting in the text generation last streaming event with NPU execution.
Added extra streaming event right after the first token is generated, in line with OpenAI API. This will correct TTFT metric benchmarking using tools relying on streaming events.
Enhanced error handling for Hugging Face Hub model pulling/downloads includes retry and resume capabilities to address network connectivity issues with large model files. Download operations can now recover from previous errors or be reported in logs when recovery is not possible.
Neural Network Compression Framework
Added experimental support for NVFP4 data type.
Introduced an additional RoPe ignored pattern without the transpose node to support 4-bit compression for models like Phi-3.5-MoE-instruct.
Migrated TorchFX backend support from torch.ao to torchao.
Upgraded PyTorch version to 2.10.0.
Upgraded ONNX version to 1.20.1 and ONNX Runtime to 1.24.3.
OpenVINO Tokenizers
Added precomputed Unicode normalization maps to reduce first inference time and memory consumption.
Remove ICU library from dependencies, reducing binary size, build time, and complexity.
OpenVINO™ GenAI
Extended prompt lookup decoding support to Vision-Language Models (VLMs) to improve tokens per second (TPS) performance.
New
AggregationMode.ADAPTIVE_RKVeviction strategy that keeps the highest attention-mass blocks and fills remaining slots with the most semantically diverse ones.VLMPipeline now supports Qwen3-VL.
LoRA adapters can now be applied to VLMPipeline (only applied to the language-model (LLM) part), enabling task-specific fine-tuning without reloading the base model.
Improved VLM image resizing accuracy.
TaylorSeer Lite caching is now available for Flux, Stable Diffusion 3, and LTX-Video (disabled by default).
LoRA adapters in GGUF format can now be loaded directly into LLMPipeline and VLMPipeline.
TextEmbeddingPipeline now supports dynamic input shapes via the NPUW plugin, enabling NPU inference for a wider range of embedding models.
Improved pipeline loading time through asynchronous tokenizer warmup.
Other Changes and Known Issues
Jupyter Notebooks
New models and use cases:
Archived Notebooks Tab: Added a dedicated “Archived” tab to the OpenVINO Notebooks portal. Users can now easily search and browse older or deprecated notebooks, keeping the main catalog focused on the latest updates while retaining access to historical content.
Known Issues
2026.0 - 23 February 2026
What’s New
More Gen AI coverage and frameworks integrations to minimize code changes
New models supported on CPUs & GPUs: GPT-OSS-20B, Qwen3-30B-A3B, MiniCPM-V-4_5-8B, and MiniCPM-o-2.6.
New models supported on NPUs: MiniCPM-o-2.6. In addition, NPU support is now available on Qwen2.5-1B-Instruct, Qwen3-Embedding-0.6B, Qwen-2.5-coder-0.5B.
Preview: OpenVINO™ GenAI adds support for video generation pipeline based on LTX-Video model on CPU and GPUs.
OpenVINO™ GenAI now adds word-level timestamp functionality to the Whisper Pipeline on CPUs, GPUs, and NPUs, enabling more accurate transcriptions and subtitling in line with OpenAI and FasterWhisper implementations.
Phi-3-mini FastDraft model is now available on Hugging Face to accelerate LLM inference on NPUs. FastDraft optimizes speculative decoding for LLMs.
Broader LLM model support and more model compression techniques
OpenVINO™ GenAI and OpenVINO™ Model Server introduces EAGLE-3 speculative decoding to accelerate LLM inference using smarter token prediction on Intel CPUs and GPUs. Validated on Qwen3-8B model
With the new int4 data-aware weight compression for 3D MatMuls, the Neural Network Compression Framework enables MoE LLMs to run with reduced memory bandwidth, and improved accuracy compared to data-free schemes-delivering faster, more efficient deployment on resource-constrained devices.
Preview: The Neural Network Compression Framework now supports per-layer and per-group Look-Up Tables (LUT) for FP8-4BLUT quantization. This enables fine-grained, codebook-based compression that reduces model size and bandwidth while improving inference speed and accuracy for LLMs and transformer workloads.
More portability and performance to run AI at the edge, in the cloud or locally
OpenVINO™ GenAI adds VLM pipeline support to enhance Agentic AI framework integration.
OpenVINO GenAI now supports speculative decoding for NPUs, delivering improved performance and efficient text generation through a small draft model that is periodically validated by the full-size model.
Preview: NPU compiler integration with the NPU plugin enables ahead-of-time and on-device compilation without relying on OEM driver updates. Developers can enable this feature for a single, ready-to-ship package that reduces integration friction and accelerates time-to-value.
OpenVINO™ Model Server adds enhanced support for audio endpoint plus agentic continuous batching and concurrent runs for improved LLM performance in agentic workflows on Intel CPUs and GPUs.
OpenVINO™ Runtime
Common
API methods that accept filesystem paths as input are now standardized to accept
std::filesystem::path. This makes path handling more consistent across OpenVINO™ and simplifies integration in modern C++ codebases that already rely onstd::filesystem. Existingstd::stringandstd::wstringoverloads are still available.
CPU Device Plugin
GPT-OSS-20B model is now supported, with improved performance for Mixture-of-Experts subgraphs as well as Paged Attention with sink input.
Rotary Position Embedding fusion and kernel optimization have been expanded to cover more LLMs, including GLM4, to enhance overall performance.
The accuracy issue with Boolean causal masks in ScaledDotProduct Attention when using BF16/FP16 precision has been resolved, addressing accuracy problems in LFM2.
XAttention (Block Sparse Attention with Antidiagonal Scoring) is now available as a preview feature to improve Time-To-First-Token (TTFT) performance when processing long context inputs.
OneTBB library in OpenVINO™ Windows release has been upgraded from 2021.2.1 to 2021.13.1
Linux docker support for offline cores on platforms with multiple NUMA nodes.
GPU Device Plugin
Improved TTFT for Qwen3-30B-A3B INT4 model, support INT8 model.
Preview support for XAttention on Intel’s Xe2/Xe3 architecture to improve TTFT performance.
2nd token latency has been improved for GPT-OSS-20B INT4 model on Intel® Core™ Ultra Series 2, Intel® Core™ Ultra Series 3, and Intel® Arc™ B-Series Graphics.
TTFT has been improved for vision language models including Phi-3.5-vision, Phi-4-multimodal, and LLaVa-NeXT-Video.
NPU Device Plugin
NPU compiler is now included in the OpenVINO™ distribution package as a separate library. This is a preview feature and can be enabled by setting
ov::intel_npu::compiler_typeproperty toPREFER_PLUGINto utilize compiler-in-plugin with fallback to compiler-in-driver in case of compatibility or support issues. By default, the NPU will continue using compiler-in-driver.A new model marshaling and serialization mechanism has been implemented to avoid weight copying during compilation, reducing peak memory consumption by up to 1x the original weights size. This mechanism is currently available only when compiler-in-plugin option is enabled.
Added support for importing CPU virtual addresses into level zero memory through Remote Tensor APIs.
Fixed various issues related to sliding window context handling in models like Gemma and Phi, improved compatibility with the recent transformers packages.
Introduced new methods to handle attention,
NPUW_LLM_PREFILL_ATTENTION_HINTcan be set toPYRAMIDto significantly improve TTFT. The default value isSTATIC(no change to the existing behavior).Reduced KV-cache memory consumption, reaching up to 2.5 GB saving for select models on longer contexts (8..12K).
OpenVINO Python API
OpenVINO™ now supports u2, u3, and u6 unsigned integer data types, enabling more efficient memory usage for quantized models. The u3 and u6 types include optimized packing that writes values into three INT8 containers using a concurrency-friendly pattern, ensuring safe concurrent read/write operations without data spanning across byte boundaries.
Introduced
release_gil_before_calling_cpp_dtorfeature in Python bindings, which optimizes Global Interpreter Lock (GIL) handling during C++ destructor calls. This improves both stability and performance in multi-threaded Python applications.Improved PyThreadState management in the Python API for increased stability and crash prevention in complex threading scenarios.
OpenVINO Python package now requires only NumPy as a runtime dependency. The other packaging dependencies have been removed, resulting in a lighter installation footprint and fewer potential dependency conflicts.
Added instructions for debugging the Python API on Linux, helping developers troubleshoot and diagnose issues more effectively.
OpenVINO Node.js API
The Node.js API has been improved with GenAI features:
New parsers have been added to the LLMPipeline to extract structured outputs, reasoning steps, and tool calls from model responses. The parsing layer is fully extensible, enabling developers to plug in their own parsers to tailor how model outputs are interpreted and consumed in downstream applications.
Added support for running Visual-Language Models, enabling richer multimodal applications that combine image, video, and text understanding in a single VLMPipeline.
Introduced a dedicated TextRerankPipeline for re-ranking documents, providing a straightforward way to improve retrieval quality and increase relevance in search and RAG scenarios.
Removed the legacy behaviour whereby
LLMPipeline.generate()could return a string. It now always returnsDecodedResults, which provides consistent access to comprehensive information about the generation result, including the output text, scores, performance metrics, and parsed values.
PyTorch Framework Support
The
axis=Noneparameter is now supported for mean reduction operations, allowing for more flexible tensor averaging.Enhanced support for complex data types has been implemented to improve compatibility with vision-language models, such as Qwen.
ONNX Framework Support
Major internal refactoring of the graph iteration mechanism has been implemented for improved performance and maintainability. The legacy path can be enabled by setting the
ONNX_ITERATOR=0environment variable. This legacy path is deprecated and will be removed in future releases.
OpenVINO™ Model Server
Improvements in performance and accuracy for GPT-OSS and Qwen3-MOE models.
Improvements in execution performance especially on Intel® Core™ Ultra Series 3 built-in GPUs
Improved chat template examples to fix handling agentic use cases
Improvements in tool parsers to be less restrictive for the generated content and improve response reliability
Better accuracy with INT4 precisions especially with long prompts
Improvements in text2speech endpoint
Added voice parameter to choose speaker based on provided embeddings vector
Corrected handling of compilation cache to speed up model loading
Improvements in speech2text endpoint:
Added handling for temperature sampling parameter
Support for timestamps in the output
New parameters have been added to VLM pipelines to control domain name restrictions for image URLs in requests, with optional URL redirection support. By default, all URLs are blocked.
NPU execution for text embeddings endpoint (experimental)
Exposed tokenizer endpoint for reranker and LLM pipelines
Added configurable preprocessing for classic models. Deployed models can include extra preprocessing layers added in at runtime. This can simplify client implementations and enable sending encoded images to models, which are accepted as an array of input. Possible options include:
Color format change
Layout change
Scale changes
Mean changes
Added support for tool parser compatible with devstral model - take advantage of unsloth/Devstral-Small-2507 model or similar for coding tasks.
Updated numerous demos
Audio endpoints
VLM endpoints usage
Agentic demo
Visual Studio Code integration for code assistant
Image classification
Optimized file handle usage to reduce the number of open files during high-load operations on Linux deployments.
Neural Network Compression Framework
Extended 4-bit compression data-aware methods (AWQ, Scale Estimation, GPTQ) to support 3D matmuls for more accurate compression of such models as GPT-OSS-20B and Qwen3-30B-A3B.
Preview support for per-layer and per-block codebooks has been introduced for 4-bit weight compression (ADAPTIVE_CODEBOOK data type), which helps to reduce the quantization error in the case of per-channel weight compression. See the example for more details.
Added NNCF Profiler for layer-by-layer profiling of OpenVINO™ model activations. This is useful for debugging quantization and compression issues, comparing model variants, and understanding activation distributions. See more details in Readme and Jupyter notebook.
Added new API method,
nncf.prune(), for unstructured pruning of PyTorch models previously supported with the deprecated and removednncf.create_compressed_model()method.NNCF optimization methods for TensorFlow models and TensorFlow backend in NNCF are deprecated and removed in 2026. It is recommended to use PyTorch analogous models for training-aware optimization methods and OpenVINO IR, PyTorch, and ONNX models for post-training optimization methods from NNCF.
The following experimental NNCF methods are deprecated and removed: NAS, Structural Pruning, AutoML, Knowledge Distillation, Mixed-Precision Quantization, Movement Sparsity.
OpenVINO Tokenizers
Added support for Qwen3 Reranker and LFM2 models.
The
UTF8Validateoperation has been made available for use in the GGUF GenAI converter.Improved tokenization accuracy through improved metaspace handling when processing special tokens.
OpenVINO™ GenAI
Added preview support for video generation via Text2Video pipeline with LTX-Video model.
Support for EAGLE3 speculative decoding pipeline to boost TPS with an additional EAGLE3 draft model. Support is also enabled on Intel NPU.
Conditional Diversity Visual Token Pruning to minimize TTFT of Qwen2/2.5 VL models, this feature is disabled by default and must be turned on.
Added word-level timestamp generation for detailed transcriptions with WhisperPipeline.
Added ChatHistory API support for VLMPipeline with images and video.
Added VLLMParser wrapper.
Added universal video tags
<ov_genai_video_i>for VLM models with video support (Qwen2-VL, Qwen2.5-VL, LLaVa-NeXT-Video)Introduced NPU support for text embedding pipelines (for Qwen3-Embeddings-0.6B and similar models).
Other Changes and Known Issues
Jupyter Notebooks
New models and use cases:
Text-to-image generation with Qwen-Image and OpenVINO (experimental)
Multi-speaker dialogue generation with FireRedTTS-2 and OpenVINO (experimental)
Document Parsing using DeepSeek-OCR and OpenVINO (experimental)
Text-to-image generation with Z-Image-Turbo and OpenVINO (experimental)
Text-Image to Video generation with Wan2.2 and OpenVINO (experimental)
End-to-End Speech Recognition with Fun-ASR-Nano and OpenVINO (experimental)
Text-to-Speech (TTS) system with Fun-CosyVoice 3.0 and OpenVINO (experimental)
Deleted notebooks (still available in 2025.4 branch)
Known Issues
Deprecation And Support#
Using deprecated features and components is not advised. They are available to enable a smooth transition to new solutions and will be discontinued in the future. For more details, refer to: OpenVINO Legacy Features and Components.
Discontinued in 2026#
The deprecated
openvino.runtimenamespace has been removed. Please use theopenvinonamespace directly.The deprecated
openvino.Type.undefinedhas been removed. Please useopenvino.Type.dynamicinstead.The PostponedConstant constructor signature has been updated for improved usability:
Old (removed):
Callable[[Tensor], None]New:
Callable[[], Tensor]
The deprecated OpenVINO™ GenAI predefined generation configs were removed.
The deprecated OpenVINO GenAI support for whisper stateless decoder model has been removed. Please use a stateful model.
The deprecated OpenVINO GenAI StreamerBase
putmethod,boolreturn type for callbacks, andChunkStreamerclass has been removed.NNCF
create_compressed_model()method is now deprecated and removed in 2026. Please usenncf.prune()method for unstructured pruning andnncf.quantize()for INT8 quantization.NNCF optimization methods for TensorFlow models and TensorFlow backend in NNCF are deprecated and removed in 2026. It is recommended to use PyTorch analogous models for training-aware optimization methods and OpenVINO™ IR, PyTorch, and ONNX models for post-training optimization methods from NNCF.
The following experimental NNCF methods are deprecated and removed: NAS, Structural Pruning, AutoML, Knowledge Distillation, Mixed-Precision Quantization, Movement Sparsity.
CPU plugin now requires support for the AVX2 instruction set as a minimum system requirement. The SSE instruction set will no longer be supported.
OpenVINO™ migrated builds based on RHEL 8 to RHEL 9.
manylinux2014 upgraded to manylinux_2_28. This aligns with modern toolchain requirements but also means that CentOS 7 will no longer be supported due to glibc incompatibility.
Deprecated and to be removed in the future#
Support for Ubuntu 20.04 has been discontinued due to the end of its standard support.
The openvino-nightly PyPI module will soon be discontinued. End-users should proceed with the Simple PyPI nightly repo instead. Find more information in the Release policy.
auto shapeandauto batchsize (reshaping a model in runtime) will be removed in the future. OpenVINO™’s dynamic shape models are recommended instead.MacOS x86 is no longer recommended for use due to the discontinuation of support.
APT & YUM Repositories Restructure: Starting with release 2025.1, users can switch to the new repository structure for APT and YUM, which no longer uses year-based subdirectories (like “2025”). The old (legacy) structure will still be available until 2026, when the change will be finalized. Detailed instructions are available on the relevant documentation pages:
OpenCV binaries will be removed from Docker images in 2026.
With the release of Node.js v22, updated Node.js bindings are now available and compatible with the latest LTS version. These bindings do not support CentOS 7, as they rely on newer system libraries unavailable on legacy systems.
Starting with 2026.0 release major internal refactoring of the graph iteration mechanism has been implemented for improved performance and maintainability. The legacy path can be enabled by setting the ONNX_ITERATOR=0 environment variable. This legacy path is deprecated and will be removed in future releases.
OpenVINO Model Server:
The dedicated OpenVINO operator for Kubernetes and OpenShift is now deprecated in favor of the recommended KServe operator. The OpenVINO operator will remain functional in upcoming OpenVINO Model Server releases but will no longer be actively developed. Since KServe provides broader capabilities, no loss of functionality is expected. On the contrary, more functionalities will be accessible and migration between other serving solutions and OpenVINO Model Server will be much easier.
TensorFlow Serving (TFS) API support is planned for deprecation. With increasing adoption of the KServe API for classic models and the OpenAI API for generative workloads, usage of the TFS API has significantly declined. Dropping date is to be determined based on the feedback, with a tentative target of mid-2026.
Support for Stateful models will be deprecated. These capabilities were originally introduced for Kaldi audio models which is no longer relevant. Current audio models support relies on the OpenAI API, and pipelines implemented via OpenVINO GenAI library.
Directed Acyclic Graph Scheduler will be deprecated in favor of pipelines managed by MediaPipe scheduler and will be removed in 2026.3. That approach gives more flexibility, includes wider range of calculators and has support for using processing accelerators.
OpenVINO™ GenAI:
start_chat()/finish_chat()APIs are deprecated and will be removed in a future major release. Pass a ChatHistory object directly togenerate()instead.
Legal Information#
You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at www.intel.com or from the OEM or retailer.
No computer system can be absolutely secure.
Intel, Atom, Core, Xeon, OpenVINO, and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. Other names and brands may be claimed as the property of others.
Copyright © 2026, Intel Corporation. All rights reserved.
For more complete information about compiler optimizations, see our Optimization Notice.
Performance varies by use, configuration and other factors.