NPU with OpenVINO GenAI#
This guide will give you extra details on how to use NPU with OpenVINO GenAI. See the installation guide for information on how to start.
Prerequisites#
Install required dependencies:
python3 -m venv npu-env
npu-env/bin/activate
pip install nncf==2.14.1 onnx==1.17.0 optimum-intel==1.21.0
pip install openvino==2025.0 openvino-tokenizers==2025.0 openvino-genai==2025.0
For the pre-production version, use the following line, instead:
pip install --pre openvino openvino-tokenizers openvino-genai --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
python -m venv npu-env
npu-env\Scripts\activate
pip install nncf==2.14.1 onnx==1.17.0 optimum-intel==1.21.0
pip install openvino==2025.0 openvino-tokenizers==2025.0 openvino-genai==2025.0
For the pre-production version, use the following line, instead:
pip install --pre openvino openvino-tokenizers openvino-genai --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
Note that for systems based on Intel® Core™ Ultra Processors Series 2, more than 16GB of RAM may be required to run prompts over 1024 tokens on models exceeding 7B parameters, such as Llama-2-7B, Mistral-0.2-7B, and Qwen-2-7B.
Make sure your model works with NPU. Some models may not be supported, for example, the FLUX.1 pipeline is currently not supported by the device.
Currently, the Whisper pipeline (using:
whisper-tiny,
whisper-base,
whisper-small, or
whisper-large)
only accepts models generated with the --disable-stateful
flag.
Here is a conversion example:
optimum-cli export openvino --trust-remote-code --model openai/whisper-tiny whisper-tiny --disable-stateful
Export an LLM model via Hugging Face Optimum-Intel#
Since symmetrically-quantized 4-bit (INT4) models are preferred for inference on NPU, make sure to export the model with the proper conversion and optimization settings.
You select one of the methods by setting the --group-size
parameter to either 128
or
-1
, respectively. See the following examples:
optimum-cli export openvino -m meta-llama/Llama-2-7b-chat-hf --weight-format int4 --sym --ratio 1.0 --group-size -1 Llama-2-7b-chat-hf
If you want to improve accuracy, make sure you:
Update NNCF:
pip install nncf==2.13
Use
--scale_estimation --dataset <dataset_name>
and accuracy aware quantization--awq
:optimum-cli export openvino -m meta-llama/Llama-2-7b-chat-hf --weight-format int4 --sym --group-size -1 --ratio 1.0 --awq --scale-estimation --dataset wikitext2 Llama-2-7b-chat-hf
Important
Remember that the negative value of -1
is required here, not 1
.
optimum-cli export openvino -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 --weight-format int4 --sym --ratio 1.0 --group-size 128 TinyLlama-1.1B-Chat-v1.0
You can also try using 4-bit (INT4) GPTQ models, which do not require specifying quantization parameters:
optimum-cli export openvino -m TheBloke/Llama-2-7B-Chat-GPTQ
meta-llama/Meta-Llama-3-8B-Instruct
meta-llama/Llama-3.1-8B
microsoft/Phi-3-mini-4k-instruct
Qwen/Qwen2-7B
mistralai/Mistral-7B-Instruct-v0.2
openbmb/MiniCPM-1B-sft-bf16
TinyLlama/TinyLlama-1.1B-Chat-v1.0
TheBloke/Llama-2-7B-Chat-GPTQ
Qwen/Qwen2-7B-Instruct-GPTQ-Int4
Run generation using OpenVINO GenAI#
It is typically recommended to install the latest available driver.
Use the following code snippet to perform generation with OpenVINO GenAI API.
import openvino_genai as ov_genai
model_path = "TinyLlama"
pipe = ov_genai.LLMPipeline(model_path, "NPU")
print(pipe.generate("The Sun is yellow because", max_new_tokens=100))
#include "openvino/genai/llm_pipeline.hpp"
#include <iostream>
int main(int argc, char* argv[]) {
std::string model_path = "TinyLlama";
ov::genai::LLMPipeline pipe(models_path, "NPU");
ov::genai::GenerationConfig config;
config.max_new_tokens=100;
std::cout << pipe.generate("The Sun is yellow because", config);
}
Additional configuration options#
Prompt and response length options#
The LLM pipeline for NPUs leverages the static shape approach, optimizing execution performance, while potentially introducing certain usage limitations. By default, the LLM pipeline supports input prompts up to 1024 tokens in length. It also ensures that the generated response contains at least 150 tokens, unless the generation encounters the end-of-sequence (EOS) token or the user explicitly sets a lower length limit for the response.
You may configure both the ‘maximum input prompt length’ and ‘minimum response length’ using the following parameters:
MAX_PROMPT_LEN
- defines the maximum number of tokens that the LLM pipeline can process for the input prompt (default: 1024),MIN_RESPONSE_LEN
- defines the minimum number of tokens that the LLM pipeline will generate in its response (default: 150).
Use the following code snippet to change the default settings:
pipeline_config = { "MAX_PROMPT_LEN": 1024, "MIN_RESPONSE_LEN": 512 }
pipe = ov_genai.LLMPipeline(model_path, "NPU", pipeline_config)
ov::AnyMap pipeline_config = { { "MAX_PROMPT_LEN", 1024 }, { "MIN_RESPONSE_LEN", 512 } };
ov::genai::LLMPipeline pipe(model_path, "NPU", pipeline_config);
Cache compiled models#
By caching compiled models, you may shorten the initialization time of the future pipeline
runs. To do so, specify one of the following options in pipeline_config
for NPU pipeline.
NPUW_CACHE_DIR#
NPUW_CACHE_DIR
is the most basic option of caching compiled subgraphs without weights and
reusing them for future pipeline runs.
CACHE_DIR#
CACHE_DIR
operates similarly to the older NPUW_CACHE_DIR
, except for two differences:
It creates a single “.blob” file and loads it faster.
It stores all model weights inside the blob, making it much bigger than individual compiled schedules for model’s subgraphs stored by
NPUW_CACHE_DIR
.
pipeline_config = { "CACHE_DIR": ".npucache" }
pipe = ov_genai.LLMPipeline(model_path, "NPU", pipeline_config)
ov::AnyMap pipeline_config = { { "CACHE_DIR", ".npucache" } };
ov::genai::LLMPipeline pipe(model_path, "NPU", pipeline_config);
‘Ahead of time’ compilation#
Specifying EXPORT_BLOB
and BLOB_PATH
parameters works similarly to CACHE_DIR
but:
It allows to explicitly specify where to store the compiled model.
For subsequent runs, it requires the same
BLOB_PATH
to import the compiled model.
pipeline_config = { "EXPORT_BLOB": "YES", "BLOB_PATH": ".npucache\\compiled_model.blob" }
pipe = ov_genai.LLMPipeline(model_path, "NPU", pipeline_config)
ov::AnyMap pipeline_config = { { "EXPORT_BLOB", "YES" }, { "BLOB_PATH", ".npucache\\compiled_model.blob" } };
ov::genai::LLMPipeline pipe(model_path, "NPU", pipeline_config);
pipeline_config = { "BLOB_PATH": ".npucache\\compiled_model.blob" }
pipe = ov_genai.LLMPipeline(model_path, "NPU", pipeline_config)
ov::AnyMap pipeline_config = { { "BLOB_PATH", ".npucache\\compiled_model.blob" } };
ov::genai::LLMPipeline pipe(model_path, "NPU", pipeline_config);
Disable memory allocation#
In case of execution failures, either silent or with errors, try to update the NPU driver to
32.0.100.3104 or newer.
If the update is not possible, set the DISABLE_OPENVINO_GENAI_NPU_L0
environment variable to disable NPU memory allocation, which might be supported
only on newer drivers for Intel Core Ultra 200V processors.
Set the environment variable in a terminal:
export DISABLE_OPENVINO_GENAI_NPU_L0=1
set DISABLE_OPENVINO_GENAI_NPU_L0=1
Performance modes#
You can configure the NPU pipeline with the GENERATE_HINT
option to switch
between two different performance modes:
FAST_COMPILE
(default) - enables fast compilation at the expense of performance,BEST_PERF
- ensures best possible performance at lower compilation speed.
Use the following code snippet:
pipeline_config = { "GENERATE_HINT": "BEST_PERF" }
pipe = ov_genai.LLMPipeline(model_path, "NPU", pipeline_config)
ov::AnyMap pipeline_config = { { "GENERATE_HINT", "BEST_PERF" } };
ov::genai::LLMPipeline pipe(model_path, "NPU", pipeline_config);