Inference with OpenVINO GenAI#
This guide will give you extra details on how to utilize NPU with the GenAI flavor. See the installation guide for information on how to start.
Prerequisites#
Install required dependencies:
python -m venv npu-env
npu-env\Scripts\activate
pip install nncf==2.12 onnx==1.16.1 optimum-intel==1.19.0
pip install --pre openvino openvino-tokenizers openvino-genai --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
Note that for systems based on Intel® Core™ Ultra Processors Series 2, more than 16GB of RAM may be required to run prompts over 1024 tokens on models exceeding 7B parameters, such as Llama-2-7B, Mistral-0.2-7B, and Qwen-2-7B.
Export an LLM model via Hugging Face Optimum-Intel#
Since symmetrically-quantized 4-bit (INT4) models are preffered for inference on NPU, make sure to export the model with the proper conversion and optimization settings.
You select one of the methods by setting the --group-size
parameter to either 128
or
-1
, respectively. See the following examples:
optimum-cli export openvino -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 --weight-format int4 --sym --ratio 1.0 --group-size 128 TinyLlama-1.1B-Chat-v1.0
optimum-cli export openvino -m meta-llama/Llama-2-7b-chat-hf --weight-format int4 --sym --ratio 1.0 --group-size -1 Llama-2-7b-chat-hf
If you want to improve accuracy, make sure you:
Update NNCF:
pip install nncf==2.13
Use
--scale_estimation --dataset <dataset_name>
and accuracy aware quantization--awq
:optimum-cli export openvino -m meta-llama/Llama-2-7b-chat-hf --weight-format int4 --sym --group-size -1 --ratio 1.0 --awq --scale-estimation --dataset wikitext2 Llama-2-7b-chat-hf
Important
Remember that the negative value of -1
is required here, not 1
.
You can also try using 4-bit (INT4) GPTQ models, which do not require specifying quantization parameters:
optimum-cli export openvino -m TheBloke/Llama-2-7B-Chat-GPTQ
meta-llama/Meta-Llama-3-8B-Instruct
meta-llama/Llama-3.1-8B
microsoft/Phi-3-mini-4k-instruct
Qwen/Qwen2-7B
mistralai/Mistral-7B-Instruct-v0.2
openbmb/MiniCPM-1B-sft-bf16
TinyLlama/TinyLlama-1.1B-Chat-v1.0
TheBloke/Llama-2-7B-Chat-GPTQ
Qwen/Qwen2-7B-Instruct-GPTQ-Int4
Run generation using OpenVINO GenAI#
It is typically recommended to install the latest available driver.
Use the following code snippet to perform generation with OpenVINO GenAI API.
Note that currently, the NPU pipeline supports greedy decoding only. This means that
you need to add do_sample=False
to the generate()
method:
import openvino_genai as ov_genai
model_path = "TinyLlama"
pipe = ov_genai.LLMPipeline(model_path, "NPU")
print(pipe.generate("The Sun is yellow because", max_new_tokens=100, do_sample=False))
#include "openvino/genai/llm_pipeline.hpp"
#include <iostream>
int main(int argc, char* argv[]) {
std::string model_path = "TinyLlama";
ov::genai::LLMPipeline pipe(models_path, "NPU");
ov::genai::GenerationConfig config;
config.do_sample=false;
config.max_new_tokens=100;
std::cout << pipe.generate("The Sun is yellow because", config);
}
Additional configuration options#
Prompt and response length options#
The LLM pipeline for NPUs leverages the static shape approach, optimizing execution performance, while potentially introducing certain usage limitations. By default, the LLM pipeline supports input prompts up to 1024 tokens in length. It also ensures that the generated response contains at least 150 tokens, unless the generation encounters the end-of-sequence (EOS) token or the user explicitly sets a lower length limit for the response.
You may configure both the ‘maximum input prompt length’ and ‘minimum response length’ using the following parameters:
MAX_PROMPT_LEN
- defines the maximum number of tokens that the LLM pipeline can process for the input prompt (default: 1024),MIN_RESPONSE_LEN
- defines the minimum number of tokens that the LLM pipeline will generate in its response (default: 150).
Use the following code snippet to change the default settings:
pipeline_config = { "MAX_PROMPT_LEN": 1024, "MIN_RESPONSE_LEN": 512 }
pipe = ov_genai.LLMPipeline(model_path, "NPU", pipeline_config)
ov::AnyMap pipeline_config = { { "MAX_PROMPT_LEN", 1024 }, { "MIN_RESPONSE_LEN", 512 } };
ov::genai::LLMPipeline pipe(model_path, "NPU", pipeline_config);
Cache compiled models#
Specify the NPUW_CACHE_DIR
option in pipeline_config
for NPU pipeline to
cache the compiled models. Using the code snippet below shortens the initialization time
of the pipeline runs coming next:
pipeline_config = { "NPUW_CACHE_DIR": ".npucache" }
pipe = ov_genai.LLMPipeline(model_path, "NPU", pipeline_config)
ov::AnyMap pipeline_config = { { "NPUW_CACHE_DIR", ".npucache" } };
ov::genai::LLMPipeline pipe(model_path, "NPU", pipeline_config);
Disable memory allocation#
In case of execution failures, either silent or with errors, try to update the NPU driver to
32.0.100.3104 or newer.
If the update is not possible, set the DISABLE_OPENVINO_GENAI_NPU_L0
environment variable to disable NPU memory allocation, which might be supported
only on newer drivers for Intel Core Ultra 200V processors.
Set the environment variable in a terminal:
export DISABLE_OPENVINO_GENAI_NPU_L0=1
set DISABLE_OPENVINO_GENAI_NPU_L0=1
Performance modes#
You can configure the NPU pipeline with the GENERATE_HINT
option to switch
between two different performance modes:
FAST_COMPILE
(default) - enables fast compilation at the expense of performance,BEST_PERF
- ensures best possible performance at lower compilation speed.
Use the following code snippet:
pipeline_config = { "GENERATE_HINT": "BEST_PERF" }
pipe = ov_genai.LLMPipeline(model_path, "NPU", pipeline_config)
ov::AnyMap pipeline_config = { { "GENERATE_HINT", "BEST_PERF" } };
ov::genai::LLMPipeline pipe(model_path, "NPU", pipeline_config);