How to serve VLM models with Continuous Batching via OpenAI API#

This demo shows how to deploy Vision Language Models in the OpenVINO Model Server using continuous batching and paged attention algorithms. Text generation use case is exposed via OpenAI API chat/completions endpoint. That makes it easy to use and efficient especially on Intel® Xeon® processors and ARC GPUs.

Note: This demo was tested on 4th - 6th generation Intel® Xeon® Scalable Processors, Intel® Arc™ GPU Series and Intel® Data Center GPU Series on Ubuntu22/24, RedHat8/9 and Windows11.

Prerequisites#

OVMS version 2025.1 This demo require version 2025.1. Till it is published, it should be built from source

Model preparation: Python 3.9 or higher with pip and HuggingFace account

Model Server deployment: Installed Docker Engine or OVMS binary package according to the baremetal deployment guide

(Optional) Client: git and Python for using OpenAI client package and vLLM benchmark app

Model preparation#

Here, the original VLM model and its auxiliary models (tokenizer, vision encoder, embeddings model etc.) will be converted to IR format and optionally quantized. That ensures faster initialization time, better performance and lower memory consumption. Execution parameters will be defined inside the graph.pbtxt file.

Download export script, install it’s dependencies and create directory for the models:

curl https://raw.githubusercontent.com/openvinotoolkit/model_server/main/demos/common/export_models/export_model.py -o export_model.py
pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/main/demos/common/export_models/requirements.txt
mkdir models

Run export_model.py script to download and quantize the model:

Note: The users in China need to set environment variable HF_ENDPOINT=”https://hf-mirror.com” before running the export script to connect to the HF Hub.

CPU

python export_model.py text_generation --source_model OpenGVLab/InternVL2_5-8B --weight-format int8 --config_file_path models/config.json --model_repository_path models  --overwrite_models

GPU

python export_model.py text_generation --source_model OpenGVLab/InternVL2_5-8B --weight-format int4 --target_device GPU --cache_size 2 --config_file_path models/config.json --model_repository_path models --overwrite_models

Note: Change the --weight-format to quantize the model to int8 or int4 precision to reduce memory consumption and improve performance.

Note: You can change the model used in the demo out of any topology tested with OpenVINO.

You should have a model folder like below:

models/
├── config.json
└── OpenGVLab
    └── InternVL2_5-8B
        ├── added_tokens.json
        ├── config.json
        ├── configuration_internlm2.py
        ├── configuration_intern_vit.py
        ├── configuration_internvl_chat.py
        ├── generation_config.json
        ├── graph.pbtxt
        ├── openvino_config.json
        ├── openvino_detokenizer.bin
        ├── openvino_detokenizer.xml
        ├── openvino_language_model.bin
        ├── openvino_language_model.xml
        ├── openvino_text_embeddings_model.bin
        ├── openvino_text_embeddings_model.xml
        ├── openvino_tokenizer.bin
        ├── openvino_tokenizer.xml
        ├── openvino_vision_embeddings_model.bin
        ├── openvino_vision_embeddings_model.xml
        ├── preprocessor_config.json
        ├── special_tokens_map.json
        ├── tokenization_internlm2.py
        ├── tokenizer_config.json
        └── tokenizer.model

The default configuration should work in most cases but the parameters can be tuned via export_model.py script arguments. Run the script with --help argument to check available parameters and see the LLM calculator documentation to learn more about configuration options.

Server Deployment#

Readiness Check#

Wait for the model to load. You can check the status with a simple command:

curl http://localhost:8000/v1/config

{
    "OpenGVLab/InternVL2_5-8B": {
        "model_version_status": [
            {
                "version": "1",
                "state": "AVAILABLE",
                "status": {
                    "error_code": "OK",
                    "error_message": "OK"
                }
            }
        ]
    }
}

Request Generation#

Let’s send a request with text an image in the messages context. zebra

Benchmarking text generation with high concurrency#

OpenVINO Model Server employs efficient parallelization for text generation. It can be used to generate text also in high concurrency in the environment shared by multiple clients. It can be demonstrated using benchmarking app from vLLM repository:

git clone --branch v0.7.3 --depth 1 https://github.com/vllm-project/vllm
cd vllm
pip3 install -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
cd benchmarks
python benchmark_serving.py --backend openai-chat --dataset-name hf --dataset-path lmarena-ai/vision-arena-bench-v0.1 --hf-split train --host localhost --port 8000 --model OpenGVLab/InternVL2_5-8B --endpoint /v3/chat/completions  --request-rate inf --num-prompts 100 --trust-remote-code

Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
============ Serving Benchmark Result ============
Successful requests:                     100
Benchmark duration (s):                  328.26
Total input tokens:                      15381
Total generated tokens:                  5231
Request throughput (req/s):              0.30
Output token throughput (tok/s):         15.94
Total Token throughput (tok/s):          62.79
---------------Time to First Token----------------
Mean TTFT (ms):                          166209.84
Median TTFT (ms):                        185027.54
P99 TTFT (ms):                           317956.52
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          217.28
Median TPOT (ms):                        243.84
P99 TPOT (ms):                           473.48
---------------Inter-token Latency----------------
Mean ITL (ms):                           394.46
Median ITL (ms):                         311.98
P99 ITL (ms):                            1500.04

RAG with visual language model and Model Server#

TBD

Testing the model accuracy over serving API#

Check the guide of using lm-evaluation-harness