How to serve Embeddings models via OpenAI API#

This demo shows how to deploy embeddings models in the OpenVINO Model Server for text feature extractions. Text generation use case is exposed via OpenAI API embeddings endpoint.

Prerequisites#

Model Server deployment: Installed Docker Engine or OVMS binary package according to the baremetal deployment guide

(Optional) Model preparation: Can be omitted when pulling models in IR format directly from HuggingFaces. Otherwise Python 3.9 or higher with pip for manual model export step.

(Optional) Client: Python with pip

Tested models#

All models supported by optimum-intel should be compatible. The demo is validated against following Hugging Face models:

Model name	Pooling
OpenVINO/Qwen3-Embedding-0.6B-int8-ov	LAST
OpenVINO/bge-base-en-v1.5-int8-ov	CLS
BAAI/bge-large-en-v1.5	CLS
BAAI/bge-large-zh-v1.5	CLS
thenlper/gte-small	CLS
sentence-transformers/all-MiniLM-L12-v2	MEAN
sentence-transformers/all-distilroberta-v1	MEAN
mixedbread-ai/deepset-mxbai-embed-de-large-v1	MEAN
intfloat/multilingual-e5-large-instruct	MEAN
intfloat/multilingual-e5-large	MEAN
Alibaba-NLP/gte-large-en-v1.5	CLS
nomic-ai/nomic-embed-text-v1.5	MEAN
sentence-transformers/all-mpnet-base-v2	MEAN

Server Deployment#

Readiness Check#

Wait for the model to load. You can check the status with a simple command below. Note that the slash / in the model name needs to be escaped with %2F:

curl http://localhost:9999/v3/models/BAAI%2Fbge-large-en-v1.5

{"id":"BAAI/bge-large-en-v1.5","object":"model","created":1763997378,"owned_by":"OVMS"}

Client code#

Benchmarking feature extraction#

An asynchronous benchmarking client can be used to access the model server performance with various load conditions. Below are execution examples captured on dual Intel(R) Xeon(R) CPU Max 9480.

git clone https://github.com/openvinotoolkit/model_server
pushd .
cd model_server/demos/benchmark/v3/
pip install -r requirements.txt
python benchmark.py --api_url http://localhost:8000/v3/embeddings --dataset synthetic --synthetic_length 5 --request_rate 10 --batch_size 1 --model BAAI/bge-large-en-v1.5
Number of documents: 1000
100%|████████████████████████████████████████████████████████████████| 1000/1000 [01:44<00:00,  9.56it/s]
Tokens: 5000
Success rate: 100.0%. (1000/1000)
Throughput - Tokens per second: 47.8
Mean latency: 14.40 ms
Median latency: 13.97 ms
Average document length: 5.0 tokens


python benchmark.py --api_url http://localhost:8000/v3/embeddings --request_rate inf --batch_size 32 --dataset synthetic --synthetic_length 510 --model BAAI/bge-large-en-v1.5
Number of documents: 1000
100%|████████████████████████████████████████████████████████████████| 32/32 [00:17<00:00,  1.82it/s]
Tokens: 510000
Success rate: 100.0%. (32/32)
Throughput - Tokens per second: 29,066.2
Mean latency: 9768.28 ms
Median latency: 9905.79 ms
Average document length: 510.0 tokens


python benchmark.py --api_url http://localhost:8000/v3/embeddings --request_rate inf --batch_size 1 --dataset Cohere/wikipedia-22-12-simple-embeddings --model BAAI/bge-large-en-v1.5
Number of documents: 1000
100%|████████████████████████████████████████████████████████████████| 1000/1000 [00:15<00:00, 64.02it/s]
Tokens: 83208
Success rate: 100.0%. (1000/1000)
Throughput - Tokens per second: 4,120.6
Mean latency: 1882.98 ms
Median latency: 1608.47 ms
Average document length: 83.208 tokens

RAG with Model Server#

Embeddings endpoint can be applied in RAG chains to delegated text feature extraction both for documented vectorization and in context retrieval. Check this demo to see the langchain code example which is using OpenVINO Model Server both for text generation and embedding endpoint in RAG application demo

Testing the model accuracy over serving API#

A simple method of testing the response accuracy is via comparing the response for a sample prompt from the model server and with local python execution based on HuggingFace python code.

The script compare_results.py can assist with such experiment.

popd
cd model_server/demos/embeddings
python compare_results.py --model BAAI/bge-large-en-v1.5 --service_url http://localhost:8000/v3/embeddings --pooling CLS --input "hello world" --input "goodbye world"

input ['hello world', 'goodbye world']
HF Duration: 93.921 ms BertModel
OVMS Duration: 160.806 ms
Batch number: 0
OVMS embeddings: shape: (1024,) emb[:20]:
 [ 0.0336  0.0321  0.0213 -0.0373 -0.0156 -0.0122  0.0246  0.0412  0.0492
  0.0207  0.0056  0.0169 -0.0133  0.0009 -0.0421  0.0206 -0.0222 -0.0291
 -0.0532  0.0382]
HF AutoModel: shape: (1024,) emb[:20]:
 [ 0.0343  0.0332  0.0219 -0.0371 -0.0158 -0.0131  0.0247  0.0408  0.0489
  0.0208  0.0053  0.0176 -0.0132  0.001  -0.0422  0.0208 -0.0213 -0.0278
 -0.0538  0.0388]
Difference score with HF AutoModel: 0.020708760995591734
Batch number: 1
OVMS embeddings: shape: (1024,) emb[:20]:
 [ 0.0161  0.0156  0.0235  0.0199  0.0005 -0.0559  0.0124  0.0122  0.0205
 -0.027   0.0152  0.0153 -0.0429 -0.0537 -0.0514 -0.0059 -0.0294 -0.0451
 -0.0371  0.0361]
HF AutoModel: shape: (1024,) emb[:20]:
 [ 0.0175  0.0161  0.0234  0.0196  0.0012 -0.0565  0.0109  0.0111  0.0194
 -0.0275  0.0148  0.0144 -0.0425 -0.0538 -0.0515 -0.0062 -0.0298 -0.0447
 -0.0376  0.0359]
Difference score with HF AutoModel: 0.020293646680283224

It is easy also to run model evaluation using MTEB framework using a custom class based on openai model:

pip install "mteb<2" einops openai --extra-index-url "https://download.pytorch.org/whl/cpu"
curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/embeddings/ovms_mteb.py -o ovms_mteb.py
python ovms_mteb.py --model BAAI/bge-large-en-v1.5 --service_url http://localhost:8000/v3/embeddings

Results will be stored in results folder:

{
  "dataset_revision": "0fd18e25b25c072e09e0d92ab615fda904d66300",
  "task_name": "Banking77Classification",
  "mteb_version": "1.39.7",
  "scores": {
    "test": [
      {
        "accuracy": 0.848636,
        "f1": 0.842405,
        "f1_weighted": 0.842405,
        "scores_per_experiment": [
          {
            "accuracy": 0.842532,
            "f1": 0.835091,
            "f1_weighted": 0.835091
          },
          {
            "accuracy": 0.851299,
            "f1": 0.844622,
            "f1_weighted": 0.844622
          },
          {
            "accuracy": 0.849026,
            "f1": 0.842238,
            "f1_weighted": 0.842238
          },
          {
            "accuracy": 0.853571,
            "f1": 0.849815,
            "f1_weighted": 0.849815
          },
          {
            "accuracy": 0.846104,
            "f1": 0.839,
            "f1_weighted": 0.839
          },
          {
            "accuracy": 0.849675,
            "f1": 0.844259,
            "f1_weighted": 0.844259
          },
          {
            "accuracy": 0.846104,
            "f1": 0.840343,
            "f1_weighted": 0.840343
          },
          {
            "accuracy": 0.846753,
            "f1": 0.8397,
            "f1_weighted": 0.8397
          },
          {
            "accuracy": 0.853571,
            "f1": 0.848239,
            "f1_weighted": 0.848239
          },
          {
            "accuracy": 0.847727,
            "f1": 0.84074,
            "f1_weighted": 0.84074
          }
        ],
        "main_score": 0.848636,
        "hf_subset": "default",
        "languages": [
          "eng-Latn"
        ]
      }
    ]
  },
  "evaluation_time": 3841.1886789798737,
  "kg_co2_emissions": null
}

Compare against local HuggingFace execution for reference:

mteb run -m thenlper/gte-small -t Banking77Classification --output_folder results

Usage of tokenize endpoint (release 2025.4 or weekly)#

The tokenize endpoint provides a simple API for tokenizing input text using the same tokenizer as the deployed embeddings model. This allows you to see how your text will be split into tokens before feature extraction or inference. The endpoint accepts a string or list of strings and returns the corresponding token IDs.

Example usage:

curl http://localhost:8000/v3/tokenize -H "Content-Type: application/json" -d "{ \"model\": \"BAAI/bge-large-en-v1.5\", \"text\": \"hello world\" }"

Response:

{
  "tokens": [101,7592,2088,102]
}

It’s possible to use additional parameters:

pad_to_max_length - whether to pad the sequence to the maximum length. Default is False.
max_length - maximum length of the sequence. If specified, it truncates the tokens to the provided number.
padding_side - side to pad the sequence, can be left or right. Default is right.
add_special_tokens - whether to add special tokens like BOS, EOS, PAD. Default is True.