OpenVINO GenAI on NPU#

This guide will give you extra details on how to use NPU with OpenVINO GenAI. See the installation guide for information on how to start.

Prerequisites#

First, install the required dependencies for the model conversion:

Linux

python3 -m venv npu-env
source npu-env/bin/activate
pip install nncf==2.18.0 onnx==1.18.0 optimum-intel==1.25.2 transformers==4.51.3
pip install openvino==2026.0 openvino-tokenizers==2026.0 openvino-genai==2026.0

For the pre-production version, use the following line, instead:

pip install --pre openvino openvino-tokenizers openvino-genai --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly

Windows

python -m venv npu-env
npu-env\Scripts\activate
pip install nncf==2.18.0 onnx==1.18.0 optimum-intel==1.25.2 transformers==4.51.3
pip install openvino==2026.0 openvino-tokenizers==2026.0 openvino-genai==2026.0

For the pre-production version, use the following line, instead:

pip install --pre openvino openvino-tokenizers openvino-genai --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly

Note

With OpenVINO 2026.0, it is highly recommended to use transformers==4.51.3 to generate models for Intel NPU. Newer Transformers versions will be supported in upcoming releases.

Note

For systems based on Intel® Core™ Ultra Processors Series 2, more than 16GB of RAM may be required to process prompts longer than 1024 tokens with models exceeding 7B parameters, such as Llama-2-7B, Mistral-0.2-7B, and Qwen-2-7B.

VLM Inference on NPU#

VLMs are supported on NPU and can be inferenced in the same way as LLms with GenAI API:

Python

import numpy as np
from PIL import Image
from openvino import Tensor
import openvino_genai as ov_genai

model_path = "Google-Gemma-3-4B-it"
image_path = "cat.png"
image = Tensor(np.array(Image.open(image_path).convert("RGB")))

pipe = ov_genai.VLMPipeline(model_path, "NPU")
print(pipe.generate("Describe the image",  images=image, max_new_tokens=100))

C++

#include "load_image.hpp"
#include <openvino/genai/visual_language/pipeline.hpp>
#include <iostream>

bool print_subword(std::string&& subword) {
   return !(std::cout << subword << std::flush);
}

int main(int argc, char* argv[]) {
   std::string model_path = "Google-Gemma-3-4B-it";
   std::string image_path = "cat.png";

   std::vector<ov::Tensor> rgbs = utils::load_images(image_path);

   ov::genai::VLMPipeline pipe(model_path, "NPU");
   ov::genai::GenerationConfig config;
   config.max_new_tokens=100;
   std::cout << pipe.generate("Describe the image",
      ov::genai::images(rgbs),
      ov::genai::generation_config(config),
      ov::genai::streamer(print_subword));
}

Passing config to VLMs#

All the parameters described above (like MAX_PROMPT_LEN, MIN_RESPONSE_LEN, CACHE_DIR, etc.) are also applicable to VLMs. However, these parameters must be provided in a slightly different way. They should be placed in {“DEVICE_PROPERTIES”: {“NPU” : … } } section of config:

Python

import numpy as np
from PIL import Image
from openvino import Tensor
import openvino_genai as ov_genai

model_path = "Phi-4-multimodal-instruct"
image_path = "cat.png"
image = Tensor(np.array(Image.open(image_path).convert("RGB")))
pipeline_config = {
   "DEVICE_PROPERTIES": {
      "NPU": {
         "MAX_PROMPT_LEN": 2048,
         "MIN_RESPONSE_LEN": 512
      },
   }
}

pipe = ov_genai.VLMPipeline(model_path, "NPU", config=pipeline_config)
print(pipe.generate("Describe the image",  images=image, max_new_tokens=100))

C++

#include "load_image.hpp"
#include <openvino/genai/visual_language/pipeline.hpp>
#include <iostream>

bool print_subword(std::string&& subword) {
   return !(std::cout << subword << std::flush);
}

int main(int argc, char* argv[]) {
   std::string model_path = "Phi-4-multimodal-instruct";
   std::string image_path = "cat.png";

   std::vector<ov::Tensor> rgbs = utils::load_images(image_path);
   ov::AnyMap pipeline_config = {
      {"DEVICE_PROPERTIES", ov::AnyMap{
         {"NPU", ov::AnyMap{
            {"MAX_PROMPT_LEN", 2048},
            {"MIN_RESPONSE_LEN", 512}
         }}
      }}
   };

   ov::genai::VLMPipeline pipe(model_path, "NPU", pipeline_config);
   ov::genai::GenerationConfig config;
   config.max_new_tokens=100;
   std::cout << pipe.generate("Describe the image",
      ov::genai::images(rgbs),
      ov::genai::generation_config(config),
      ov::genai::streamer(print_subword));
}

Whisper Inference on NPU#

OpenAI Whisper support (for whisper-tiny, whisper-base, whisper-small, or whisper-large models) was first introduced in OpenVINO 2024.5. There are no NPU-specific requirements when running the Whisper GenAI pipeline on NPU, so a standard OpenVINO GenAI sample works without any limitations.

Export Whisper models from Hugging Face#

Prior to OpenVINO 2025.1 the Whisper pipeline only accepted stateless Whisper models, exported with --disable-stateful flag:

optimum-cli export openvino --trust-remote-code --model openai/whisper-tiny whisper-tiny --disable-stateful

Since OpenVINO 2025.1, this is no longer required. Weights can remain in FP16 or be compressed in INT8:

optimum-cli export openvino --trust-remote-code --model openai/whisper-base whisper-base-int8 --weight-format int8

Troubleshooting#

In case of execution failures, either silent or with errors, try to update the NPU driver to 32.0.100.3104 or newer. If the update is not possible and you get “out of memory” errors, try setting the DISABLE_OPENVINO_GENAI_NPU_L0 environment variable to disable Level0 memory allocation.

Set the environment variable in a terminal:

Linux

export DISABLE_OPENVINO_GENAI_NPU_L0=1

Windows

set DISABLE_OPENVINO_GENAI_NPU_L0=1

OpenVINO GenAI on NPU#

Prerequisites#

LLM Inference on NPU#

Export LLMs from Hugging Face#

Run text generation#

Additional configuration options#

Prompt and response length options#

Performance hints#

Caching and ahead-of-time compilation#

OpenVINO Caching#

CACHE_DIR#

NPUW_CACHE_DIR#

Ahead-of-time compilation#

Blob encryption#

VLM Inference on NPU#

Passing config to VLMs#

Whisper Inference on NPU#

Export Whisper models from Hugging Face#

Troubleshooting#

Additional Resources#