OpenVINO GenAI on NPU#

This guide will give you extra details on how to use NPU with OpenVINO GenAI. See the installation guide for information on how to start.

Prerequisites#

First, install the required dependencies for the model conversion:

python3 -m venv npu-env
source npu-env/bin/activate
pip install nncf==2.18.0 onnx==1.18.0 optimum-intel==1.25.2 transformers==4.51.3
pip install openvino==2025.3 openvino-tokenizers==2025.3 openvino-genai==2025.3

For the pre-production version, use the following line, instead:

pip install --pre openvino openvino-tokenizers openvino-genai --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
python -m venv npu-env
npu-env\Scripts\activate
pip install nncf==2.18.0 onnx==1.18.0 optimum-intel==1.25.2 transformers==4.51.3
pip install openvino==2025.3 openvino-tokenizers==2025.3 openvino-genai==2025.3

For the pre-production version, use the following line, instead:

pip install --pre openvino openvino-tokenizers openvino-genai --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly

Note

With OpenVINO 2025.3, it is highly recommended to use transformers==4.51.3 to generate models for Intel NPU. Newer Transformers versions will be supported in upcoming releases.

Note

For systems based on Intel® Core™ Ultra Processors Series 2, more than 16GB of RAM may be required to process prompts longer than 1024 tokens with models exceeding 7B parameters, such as Llama-2-7B, Mistral-0.2-7B, and Qwen-2-7B.

LLM Inference on NPU#

Export LLMs from Hugging Face#

Optimum Intel is the primary way to export Hugging Face models for inference on NPU. LLMs must be exported with the following settings:

  • Symmetric weights compression: --sym;

  • 4-bit weight format (INT4 or NF4): --weight-format int4 or --weight-format nf4;

  • Channel-wise or group-wise weight quantization: --group-size -1 or --group-size 128;

  • Maximize the 4-bit weight ratio in the model: --ratio 1.0.

Group quantization (GQ) with group size 128 is recommended for smaller models, e.g. up to 4B–5B parameters. Larger models may also work with group-quantization, but normally demonstrate a better performance with channel-wise quantization.

Channel-wise quantization (CW) generally offers the best performance but may reduce model accuracy. OpenVINO Neural Network Compression Framework (NNCF) provides several methods to compensate for the quality loss, such as data-aware compression methods or GPTQ.

The full optimum-cli command examples are shown below:

INT4 Symmetric channel-wise data-free compression:

optimum-cli export openvino -m meta-llama/Meta-Llama-3.1-8B-Instruct --weight-format int4 --sym --ratio 1.0 --group-size -1 Meta-Llama-3.1-8B-Instruct

INT4 Symmetric data-aware channel-wise compression:

Use Scale Estimation (--scale_estimation) and/or AWQ (--awq) to improve accuracy for the channel-wise quantized models. Note that these options require a dataset (--dataset <dataset_name>). Refer to optimum-cli and NNCF documentation for more details.

optimum-cli export openvino -m meta-llama/Meta-Llama-3.1-8B-Instruct --weight-format int4 --sym --group-size -1 --ratio 1.0 --awq --scale-estimation --dataset wikitext2 Meta-Llama-3.1-8B-Instruct

NF4 Symmetric data-free channel-wise compression:

optimum-cli export openvino -m meta-llama/Meta-Llama-3.1-8B-Instruct --weight-format nf4 --sym --group-size -1 --ratio 1.0  Meta-Llama-3.1-8B-Instruct

Usually, NF4-CW provides a better accuracy compared to INT4-CW even with data-free compression. Data-aware methods are also available and can further improve the compressed model accuracy.

INT4 Symmetric data-free group quantization:

optimum-cli export openvino -m microsoft/Phi-3.5-mini-instruct --weight-format int4 --sym --ratio 1.0 --group-size 128 Phi-3.5-mini-instruct

Data-aware methods are also available for the group-quantized models and can further improve the compressed model accuracy.

Note

The NF4 data type is only supported on Intel® Core Ultra Processors Series 2 NPUs (formerly codenamed Lunar Lake) and beyond. Use channel-wise quantization with NF4.

Important

For the channel-wise quantization, the group size argument must be -1 (“minus one”), not 1.

There are pre-compressed models on Hugging Face that can be exported as-is, for example:

In this case, the commands are as simple as:

optimum-cli export openvino -m TheBloke/Llama-2-7B-Chat-GPTQ
optimum-cli export openvino -m OpenVINO/Mistral-7B-Instruct-v0.2-int4-cw-ov

Run text generation#

It is recommended to install the latest available NPU driver. Use the following code snippet to perform generation with OpenVINO GenAI API.

import openvino_genai as ov_genai
model_path = "TinyLlama"
pipe = ov_genai.LLMPipeline(model_path, "NPU")
print(pipe.generate("The Sun is yellow because", max_new_tokens=100))
#include "openvino/genai/llm_pipeline.hpp"
#include <iostream>

int main(int argc, char* argv[]) {
   std::string model_path = "TinyLlama";
   ov::genai::LLMPipeline pipe(models_path, "NPU");
   ov::genai::GenerationConfig config;
   config.max_new_tokens=100;
   std::cout << pipe.generate("The Sun is yellow because", config);
}

Additional configuration options#

Important

The options described in this article are specific to the NPU device and may not work with other devices.

Prompt and response length options#

The LLM pipeline for NPUs leverages the static shape approach, optimizing execution performance, while potentially introducing certain usage limitations. By default, the LLM pipeline supports input prompts up to 1024 tokens in length. It also ensures that the generated response contains at least 128 tokens, unless the generation encounters the end-of-sequence (EOS) token or the user explicitly sets a lower length limit for the response.

You may configure both the ‘maximum input prompt length’ and ‘minimum response length’ using the following parameters:

  • MAX_PROMPT_LEN – defines the maximum number of tokens that the LLM pipeline can process for the input prompt (default: 1024),

  • MIN_RESPONSE_LEN – defines the minimum number of tokens that the LLM pipeline can generate in its response (default: 128).

The maximum context size for an LLM on NPU is defined as the sum of these two values. By default, if the input prompt is shorter than MAX_PROMPT_LEN tokens, time to first token (TTFT) remains the same as if a full-length prompt was passed. However, a shorter prompt allows the model to generate more tokens within the available context. For example, if the input prompt is just 30 tokens, the model can generate up to \(1024 + 128 - 30 = 1122\) tokens.

OpenVINO 2025.3 has introduced dynamic input prompt support for NPU. The dynamism granularity is controlled by the new property NPUW_LLM_PREFILL_CHUNK_SIZE (default: 1024).

If the MAX_PROMPT_LEN property is set to a value greater than the chunk size, the mechanism is activated automatically. Set PREFILL_HINT to STATIC to disable this feature.

Use the following code snippet to change the default settings:

pipeline_config = { "MAX_PROMPT_LEN": 1024, "MIN_RESPONSE_LEN": 512 }
pipe = ov_genai.LLMPipeline(model_path, "NPU", pipeline_config)
ov::AnyMap pipeline_config = { { "MAX_PROMPT_LEN",  1024 }, { "MIN_RESPONSE_LEN", 512 } };
ov::genai::LLMPipeline pipe(model_path, "NPU", pipeline_config);

In the GenAI LLM chat scenarios, the conversation history is accumulated in the context and may require a larger MAX_PROMPT_LEN to handle the history properly.

Performance hints#

You can configure the NPU LLM pipeline using the PREFILL_HINT and GENERATE_HINT options to fine-tune performance. These options impact prompt processing (first token) and text generation (subsequent tokens) behavior, respectively.

PREFILL_HINT – fine-tunes the prompt processing stage:

  • DYNAMIC (default since OpenVINO 2025.3) – enables dynamic prompt execution, supports longer prompts.

  • STATIC – disables dynamic prompt execution, may provide better performance for specific prompt sizes. Default behavior before OpenVINO 2025.3.

GENERATE_HINT – fine-tunes the text generation stage:

  • FAST_COMPILE (default) – enables fast compilation at the expense of performance,

  • BEST_PERF – ensures the best possible performance at lower compilation speed.

Use the following code snippet:

pipeline_config = { "GENERATE_HINT": "BEST_PERF" }
pipe = ov_genai.LLMPipeline(model_path, "NPU", pipeline_config)
ov::AnyMap pipeline_config = { { "GENERATE_HINT",  "BEST_PERF" } };
ov::genai::LLMPipeline pipe(model_path, "NPU", pipeline_config);

Caching and ahead-of-time compilation#

LLM compilation for NPU happens on-the-fly and may take substantial time. To improve user experience, the following options are available: OpenVINO Caching and Ahead-of-time (AoT) compilation.

OpenVINO Caching#

By caching compiled models, you can reduce the initialization time for subsequent pipeline runs. To do so, specify one of the following options in pipeline_config for the NPU pipeline.

CACHE_DIR#

CACHE_DIR is the default OpenVINO caching mechanism. The CACHE_MODE hint defines how the cached blob stores weights. OPTIMIZE_SPEED includes the weights and allows faster loading for group-quantized models. OPTIMIZE_SIZE excludes the weights, producing a weightless blob, and requires the original model to be present on disk.

pipeline_config = { "CACHE_DIR": ".npucache" }
pipe = ov_genai.LLMPipeline(model_path, "NPU", pipeline_config)
ov::AnyMap pipeline_config = { { "CACHE_DIR",  ".npucache" } };
ov::genai::LLMPipeline pipe(model_path, "NPU", pipeline_config);
NPUW_CACHE_DIR#

NPUW_CACHE_DIR is a legacy NPU-specific weightless caching option. Since OpenVINO 2025.1, the preferred device-neutral caching mechanism is the OpenVINO caching (CACHE_DIR).

Ahead-of-time compilation#

Specifying EXPORT_BLOB and BLOB_PATH parameters works similarly to CACHE_DIR but:

  • It allows to explicitly specify where to store the compiled model.

  • For subsequent runs, it requires the same BLOB_PATH to import the compiled model.

  • Blob type is also defined by CACHE_MODE.

    • By default, OPTIMIZE_SIZE is used, producing a weightless blob. To load this blob, either the original weights file or an ov::Model object is required.

    • Pass OPTIMIZE_SPEED to export a blob with full weights.

  • If the blob is exported as weightless you also need to either provide "WEIGHTS_PATH" : "path\\to\\original\\model.bin" or "MODEL_PTR" : original ov::Model object in the config.

  • Ahead-of-time import in weightless mode has been optimized to consume less memory than during regular compilation or using CACHE_DIR.

The following snippets demonstrate the functionality:

pipeline_config = { "EXPORT_BLOB": "YES", "BLOB_PATH": ".npucache\\compiled_model.blob" }
pipe = ov_genai.LLMPipeline(model_path, "NPU", pipeline_config)
ov::AnyMap pipeline_config = { { "EXPORT_BLOB", "YES" }, { "BLOB_PATH", ".npucache\\compiled_model.blob" } };
ov::genai::LLMPipeline pipe(model_path, "NPU", pipeline_config);
pipeline_config = { "BLOB_PATH": ".npucache\\compiled_model.blob", "WEIGHTS_PATH": "path\\to\\original\\model.bin" }
pipe = ov_genai.LLMPipeline(model_path, "NPU", pipeline_config)
ov::AnyMap pipeline_config = { { "BLOB_PATH", ".npucache\\compiled_model.blob" }, { "WEIGHTS_PATH", "path\\to\\original\\model.bin" } };
ov::genai::LLMPipeline pipe(model_path, "NPU", pipeline_config);
pipeline_config = { "EXPORT_BLOB": "YES", "BLOB_PATH": ".npucache\\compiled_model.blob", "CACHE_MODE" : "OPTIMIZE_SPEED" }
pipe = ov_genai.LLMPipeline(model_path, "NPU", pipeline_config)
ov::AnyMap pipeline_config = { { "EXPORT_BLOB", "YES" }, { "BLOB_PATH", ".npucache\\compiled_model.blob" }, { "CACHE_MODE", "OPTIMIZE_SPEED" } };
ov::genai::LLMPipeline pipe(model_path, "NPU", pipeline_config);
pipeline_config = { "BLOB_PATH": ".npucache\\compiled_model.blob" }
pipe = ov_genai.LLMPipeline(model_path, "NPU", pipeline_config)
ov::AnyMap pipeline_config = { { "BLOB_PATH",  ".npucache\\compiled_model.blob" } };
ov::genai::LLMPipeline pipe(model_path, "NPU", pipeline_config);

Blob encryption#

When exporting NPU LLM blobs, you can also specify encryption and decryption functions for the blob. In case of weightless blob the whole blob is encrypted, in case of blob with weights everything but model weights is encrypted.

ov::EncryptionCallbacks encryption_callbacks;
encryption_callbacks.encrypt = [](const std::string& s) { return s; };
ov::AnyMap pipeline_config = { { "EXPORT_BLOB", "YES" }, { "BLOB_PATH", ".npucache\\compiled_model.blob" }, { "CACHE_ENCRYPTION_CALLBACKS", encryption_callbacks } };
ov::genai::LLMPipeline pipe(model_path, "NPU", pipeline_config);
ov::EncryptionCallbacks encryption_callbacks;
encryption_callbacks.decrypt = [](const std::string& s) { return s; };
ov::AnyMap pipeline_config = { { "BLOB_PATH", ".npucache\\compiled_model.blob" }, { "WEIGHTS_PATH", "path\\to\\original\\model.bin" }, { "CACHE_ENCRYPTION_CALLBACKS", encryption_callbacks } };
ov::genai::LLMPipeline pipe(model_path, "NPU", pipeline_config);

Whisper Inference on NPU#

OpenAI Whisper support (for whisper-tiny, whisper-base, whisper-small, or whisper-large models) was first introduced in OpenVINO 2024.5. There are no NPU-specific requirements when running the Whisper GenAI pipeline on NPU, so a standard OpenVINO GenAI sample works without any limitations.

Export Whisper models from Hugging Face#

Prior to OpenVINO 2025.1 the Whisper pipeline only accepted stateless Whisper models, exported with --disable-stateful flag:

optimum-cli export openvino --trust-remote-code --model openai/whisper-tiny whisper-tiny --disable-stateful

Since OpenVINO 2025.1, this is no longer required. Weights can remain in FP16 or be compressed in INT8:

optimum-cli export openvino --trust-remote-code --model openai/whisper-base whisper-base-int8 --weight-format int8

Troubleshooting#

In case of execution failures, either silent or with errors, try to update the NPU driver to 32.0.100.3104 or newer. If the update is not possible and you get “out of memory” errors, try setting the DISABLE_OPENVINO_GENAI_NPU_L0 environment variable to disable Level0 memory allocation.

Set the environment variable in a terminal:

export DISABLE_OPENVINO_GENAI_NPU_L0=1
set DISABLE_OPENVINO_GENAI_NPU_L0=1

Additional Resources#