Serving for Text generation with Visual Language Models with NPU acceleration#

This demo shows how to deploy VLM models in the OpenVINO Model Server with NPU acceleration. From the client perspective it is very similar to the generative model deployment with continuous batching Likewise it exposes the models via OpenAI API chat/completions. The difference is that it doesn’t support request batching. They can be sent concurrently but they are processed sequentially. It is targeted on client machines equipped with NPU accelerator.

Note: This demo was tested on MeteorLake, LunarLake, ArrowLake platforms on Windows11 and Ubuntu24.

Prerequisites#

OVMS 2025.1

Model preparation: Python 3.9 or higher with pip and HuggingFace account

Model Server deployment: Installed Docker Engine or OVMS binary package according to the baremetal deployment guide

(Optional) Client: git and Python for using OpenAI client package and vLLM benchmark app

Model preparation#

Here, the original Pytorch LLM model and the tokenizer will be converted to IR format and optionally quantized. That ensures faster initialization time, better performance and lower memory consumption. LLM engine parameters will be defined inside the graph.pbtxt file.

Download export script, install it’s dependencies and create directory for the models:

curl https://raw.githubusercontent.com/openvinotoolkit/model_server/releases/2025/1/demos/common/export_models/export_model.py -o export_model.py
pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/releases/2025/1/demos/common/export_models/requirements.txt
mkdir models

Run export_model.py script to download and quantize the model:

Note: The users in China need to set environment variable HF_ENDPOINT=”https://hf-mirror.com” before running the export script to connect to the HF Hub.

LLM

python export_model.py text_generation --source_model microsoft/Phi-3.5-vision-instruct --target_device NPU --config_file_path models/config.json --model_repository_path models  --overwrite_models

Note that by default, NPU sets limitation on the prompt length (which in VLM also include image tokens) to 1024 tokens. You can modify that limit by using --max_prompt_len parameter.

Note: You can change the model used in the demo out of any topology tested with OpenVINO.

You should have a model folder like below:

tree models
models
├── config.json
└── microsoft
    └── Phi-3.5-vision-instruct
        ├── config.json
        ├── generation_config.json
        ├── graph.pbtxt
        ├── openvino_detokenizer.bin
        ├── openvino_detokenizer.xml
        ├── openvino_model.bin
        ├── openvino_model.xml
        ├── openvino_tokenizer.bin
        ├── openvino_tokenizer.xml
        ├── special_tokens_map.json
        ├── tokenizer_config.json
        └── tokenizer.json

The default configuration should work in most cases but the parameters can be tuned via export_model.py script arguments. Run the script with --help argument to check available parameters and see the LLM calculator documentation to learn more about configuration options.

Server Deployment#

Readiness Check#

Wait for the model to load. You can check the status with a simple command:

curl http://localhost:8000/v1/config

{
    "microsoft/Phi-3.5-vision-instruct": {
        "model_version_status": [
            {
                "version": "1",
                "state": "AVAILABLE",
                "status": {
                    "error_code": "OK",
                    "error_message": "OK"
                }
            }
        ]
    }
}

Request Generation#

pip3 install requests
curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/1/demos/common/static/images/zebra.jpeg -o zebra.jpeg

zebra

Benchmarking text generation with high concurrency#

OpenVINO Model Server with NPU acceleration process the requests sequentially. For that reason, benchmarking should be performed in max_concurrency set to 1. Parallel requests will be accepted but they will wait for a free execution slot. Benchmarking can be demonstrated using benchmarking app from vLLM repository:

git clone --branch v0.7.3 --depth 1 https://github.com/vllm-project/vllm
cd vllm
pip3 install -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
cd benchmarks
curl -L https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json -o ShareGPT_V3_unfiltered_cleaned_split.json # sample dataset
python benchmark_serving.py --backend openai-chat --dataset-name hf --dataset-path lmarena-ai/vision-arena-bench-v0.1 --hf-split train --host localhost --port 8000 --model microsoft/Phi-3.5-vision-instruct --endpoint /v3/chat/completions  --num-prompts 10 --trust-remote-code --max-concurrency 1

Testing the model accuracy over serving API#

Check the guide of using lm-evaluation-harness

Limitations#

requests MUST include one and only one image in the messages context. Other request will be rejected.
beam_search algorithm is not supported with NPU. Greedy search and multinomial algorithms are supported.
models must be exported with INT4 precision and --sym --ratio 1.0 --group-size -1 params. This is enforced in the export_model.py script when the target_device in NPU.
log_probs are not supported
finish reason is always set to “stop”.
only a single response can be returned. Parameter n is not supported.