Inference with OpenVINO GenAI#

This guide will give you extra details on how to utilize NPU with the GenAI flavor. See the installation guide for information on how to start.

Prerequisites#

Install required dependencies:

python -m venv npu-env
npu-env\Scripts\activate
pip install nncf==2.12 onnx==1.16.1 optimum-intel==1.19.0
pip install --pre openvino openvino-tokenizers openvino-genai --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly

Note that for systems based on Intel® Core™ Ultra Processors Series 2, more than 16GB of RAM may be required to run prompts over 1024 tokens on models exceeding 7B parameters, such as Llama-2-7B, Mistral-0.2-7B, and Qwen-2-7B.

Export an LLM model via Hugging Face Optimum-Intel#

Since symmetrically-quantized 4-bit (INT4) models are preffered for inference on NPU, make sure to export the model with the proper conversion and optimization settings.

You may export LLMs via Optimum-Intel, using one of two compression methods:
group quantization - for both smaller and larger models,
channel-wise quantization - remarkably effective but for models exceeding 1 billion parameters.

You select one of the methods by setting the --group-size parameter to either 128 or -1, respectively. See the following examples:

Group quantization

optimum-cli export openvino -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 --weight-format int4 --sym --ratio 1.0 --group_size 128 TinyLlama-1.1B-Chat-v1.0

Channel-wise quantization

Data-free quantization

optimum-cli export openvino -m meta-llama/Llama-2-7b-chat-hf --weight-format int4 --sym --ratio 1.0 --group-size -1 Llama-2-7b-chat-hf

Data-aware quantization

If you want to improve accuracy, make sure you:

Update NNCF: pip install nncf==2.13

Use --scale_estimation --dataset=<dataset_name> and accuracy aware quantization --awq:

optimum-cli export openvino -m meta-llama/Llama-2-7b-chat-hf --weight-format int4 --sym --group-size -1 --ratio 1.0 --awq --scale-estimation --dataset=wikitext2  Llama-2-7b-chat-hf

Important

Remember that the negative value of -1 is required here, not 1.

You can also try using 4-bit (INT4) GPTQ models, which do not require specifying quantization parameters:

optimum-cli export openvino -m TheBloke/Llama-2-7B-Chat-GPTQ

Remember, NPU supports GenAI models quantized symmetrically to INT4.

Below is a list of such models:

meta-llama/Meta-Llama-3-8B-Instruct
meta-llama/Llama-3.1-8B
microsoft/Phi-3-mini-4k-instruct
Qwen/Qwen2-7B
mistralai/Mistral-7B-Instruct-v0.2
openbmb/MiniCPM-1B-sft-bf16
TinyLlama/TinyLlama-1.1B-Chat-v1.0
TheBloke/Llama-2-7B-Chat-GPTQ
Qwen/Qwen2-7B-Instruct-GPTQ-Int4

Run generation using OpenVINO GenAI#

It is typically recommended to install the latest available driver.

Use the following code snippet to perform generation with OpenVINO GenAI API. Note that currently, the NPU pipeline supports greedy decoding only. This means that you need to add do_sample=False to the generate() method:

Python

import openvino_genai as ov_genai
model_path = "TinyLlama"
pipe = ov_genai.LLMPipeline(model_path, "NPU")
print(pipe.generate("The Sun is yellow because", max_new_tokens=100, do_sample=False))

C++

#include "openvino/genai/llm_pipeline.hpp"
#include <iostream>

int main(int argc, char* argv[]) {
   std::string model_path = "TinyLlama";
   ov::genai::LLMPipeline pipe(models_path, "NPU");
   ov::genai::GenerationConfig config;
   config.do_sample=false;
   config.max_new_tokens=100;
   std::cout << pipe.generate("The Sun is yellow because", config);
}

Additional configuration options#

Prompt and response length options#

The LLM pipeline for NPUs leverages the static shape approach, optimizing execution performance, while potentially introducing certain usage limitations. By default, the LLM pipeline supports input prompts up to 1024 tokens in length. It also ensures that the generated response contains at least 150 tokens, unless the generation encounters the end-of-sequence (EOS) token or the user explicitly sets a lower length limit for the response.

You may configure both the ‘maximum input prompt length’ and ‘minimum response length’ using the following parameters:

MAX_PROMPT_LEN - defines the maximum number of tokens that the LLM pipeline can process for the input prompt (default: 1024),
MIN_RESPONSE_LEN - defines the minimum number of tokens that the LLM pipeline will generate in its response (default: 150).

Use the following code snippet to change the default settings:

Python

pipeline_config = { "MAX_PROMPT_LEN": 1024, "MIN_RESPONSE_LEN": 512 }
pipe = ov_genai.LLMPipeline(model_path, "NPU", pipeline_config)

C++

ov::AnyMap pipeline_config = { { "MAX_PROMPT_LEN",  1024 }, { "MIN_RESPONSE_LEN", 512 } };
ov::genai::LLMPipeline pipe(model_path, "NPU", pipeline_config);

Cache compiled models#

Specify the NPUW_CACHE_DIR option in pipeline_config for NPU pipeline to cache the compiled models. Using the code snippet below shortens the initialization time of the pipeline runs coming next:

Python

pipeline_config = { "NPUW_CACHE_DIR": ".npucache" }
pipe = ov_genai.LLMPipeline(model_path, "NPU", pipeline_config)

C++

ov::AnyMap pipeline_config = { { "NPUW_CACHE_DIR",  ".npucache" } };
ov::genai::LLMPipeline pipe(model_path, "NPU", pipeline_config);

Disable memory allocation#

In case of execution failures, either silent or with errors, try to update the NPU driver to 32.0.100.3104 or newer. If the update is not possible, set the DISABLE_OPENVINO_GENAI_NPU_L0 environment variable to disable NPU memory allocation, which might be supported only on newer drivers for Intel Core Ultra 200V processors.

Set the environment variable in a terminal:

Linux

export DISABLE_OPENVINO_GENAI_NPU_L0=1

Windows

set DISABLE_OPENVINO_GENAI_NPU_L0=1

Performance modes#

You can configure the NPU pipeline with the GENERATE_HINT option to switch between two different performance modes:

FAST_COMPILE (default) - enables fast compilation at the expense of performance,
BEST_PERF - ensures best possible performance at lower compilation speed.

Use the following code snippet:

Python

pipeline_config = { "GENERATE_HINT": "BEST_PERF" }
pipe = ov_genai.LLMPipeline(model_path, "NPU", pipeline_config)

C++

ov::AnyMap pipeline_config = { { "GENERATE_HINT",  "BEST_PERF" } };
ov::genai::LLMPipeline pipe(model_path, "NPU", pipeline_config);

Inference with OpenVINO GenAI#

Prerequisites#

Export an LLM model via Hugging Face Optimum-Intel#

Run generation using OpenVINO GenAI#

Additional configuration options#

Prompt and response length options#

Cache compiled models#

Disable memory allocation#

Performance modes#

Additional Resources#