Testing LLM and VLM serving accuracy#

This guide shows how to access to LLM and VLM model over serving endpoint.

The lm-evaluation-harness framework provides a convenient method of evaluating the quality of the model exposed over OpenAI API. It reports end to end quality of served model from the client application point of view.

Note: Below steps have been verified on Linux

Preparing the lm-evaluation-harness framework#

Install the framework via pip:

pip3 install --extra-index-url "https://download.pytorch.org/whl/cpu" lm_eval[api] langdetect immutabledict dotenv openai

Starting the model server#

Example of LLM and VLM models deployment is documented in other demos like Agentic usage for LLM models Using VLM models

Running the tests for LLM models#

lm-eval --model local-chat-completions --tasks gsm8k --model_args model=meta-llama/Meta-Llama-3.1-8B-Instruct,base_url=http://localhost:8000/v3/chat/completions,num_concurrent=1,max_retries=3,tokenized_requests=False --verbosity DEBUG  --log_samples --output_path test/ --seed 1 --apply_chat_template --limit 100

local-chat-completions ({'model': 'meta-llama/Meta-Llama-3.1-8B-Instruct', 'base_url': 'http://localhost:8000/v3/chat/completions', 'num_concurrent': 10, 'max_retries': 3, 'tokenized_requests': False}), gen_kwargs: ({}), limit: 100.0, num_fewshot: None, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.62|±  |0.0488|
|     |       |strict-match    |     5|exact_match|↑  | 0.17|±  |0.0378|

While testing the non chat model and completion endpoint, the command would look like this:

lm-eval --model local-completions --tasks gsm8k --model_args model=meta-llama/Meta-Llama-3.1-8B,base_url=http://localhost:8000/v3/completions,num_concurrent=1,max_retries=3,tokenized_requests=False --verbosity DEBUG  --log_samples --output_path results/ --seed 1 --limit 100

local-completions ({'model': 'meta-llama/Meta-Llama-3.1-8B', 'base_url': 'http://localhost:8000/v3/completions', 'num_concurrent': 10, 'max_retries': 3, 'tokenized_requests': False}), gen_kwargs: ({}), limit: 100.0, num_fewshot: None, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.43|±  |0.0498|
|     |       |strict-match    |     5|exact_match|↑  | 0.43|±  |0.0498|

Other examples are below:

lm-eval --model local-chat-completions --tasks leaderboard_ifeval --model_args model=meta-llama/Meta-Llama-3.1-8B-Instruct,base_url=http://localhost:8000/v3/chat/completions,num_concurrent=10,max_retries=3,tokenized_requests=False --verbosity DEBUG --log_samples --output_path test/ --seed 1 --limit 100 --apply_chat_template  
lm-eval --model local-completions --tasks wikitext --model_args model=meta-llama/Meta-Llama-3.1-8B,base_url=http://localhost:8000/v3/completions,num_concurrent=10,max_retries=3,tokenized_requests=False --verbosity DEBUG --log_samples --output_path test/ --seed 1 --limit 100

Running the tests for VLM models#

Use lmms-eval project - mme and mmmu_val tasks.

export OPENAI_BASE_URL=http://localhost:8000/v3
export OPENAI_API_KEY="unused"
git clone https://github.com/EvolvingLMMs-Lab/lmms-eval
cd lmms-eval
pip install -e . --extra-index-url "https://download.pytorch.org/whl/cpu"
python -m lmms_eval \
    --model openai_compatible \
    --model_args model_version=OpenVINO/InternVL2-8B_int4-ov,max_retries=1 \
    --tasks mme,mmmu_val \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix openai_compatible \
    --output_path ./logs

Results example:

openai_compatible (model_version=OpenGVLab/InternVL2_5-8B,max_retries=1), gen_kwargs: (), limit: None, num_fewshot: None, batch_size: 1
| Tasks  |Version|Filter|n-shot|       Metric       |   |  Value  |   |Stderr|
|--------|-------|------|-----:|--------------------|---|--------:|---|------|
|mme     |Yaml   |none  |     0|mme_cognition_score |↑  | 600.3571|±  |   N/A|
|mme     |Yaml   |none  |     0|mme_perception_score|↑  |1618.2984|±  |   N/A|
|mmmu_val|      0|none  |     0|mmmu_acc            |↑  |   0.5322|±  |   N/A|

Running the tests for agentic models with function calls#

Use Berkeley function call leaderboard

git clone https://github.com/ShishirPatil/gorilla
cd gorilla/berkeley-function-call-leaderboard
git checkout 9b8a5202544f49a846aced185a340361231ef3e1
curl -s https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/continuous_batching/accuracy/gorilla.patch | git apply -v
pip install -e . --extra-index-url "https://download.pytorch.org/whl/cpu"

The commands below assumes the models is deployed with the name ovms-model. It must match the name set in the bfcl_eval/constants/model_config.py.

export OPENAI_BASE_URL=http://localhost:8000/v3
export CHAT_TEMPLATE_KWARGS='{"enable_thinking":false, "reasoning_effort":"low", "preserve_reasoning":false}'

bfcl generate --model ovms-model --test-category simple_python,multiple,multi_turn_base --temperature 0.0 --num-threads 10 -o --result-dir model_name_dir
bfcl evaluate --model ovms-model --result-dir model_name_dir 

Alternatively, use the model name ovms-model-stream to run the tests with stream requests. The results should be the same.

export OPENAI_BASE_URL=http://localhost:8000/v3
bfcl generate --model ovms-model-stream --test-category simple_python,multiple,multi_turn_base --temperature 0.0 --num-threads 10 -o --result-dir model_name_dir
bfcl evaluate --model ovms-model-stream --result-dir model_name_dir 

Analyzing results The output artifacts will be stored in result and scores. For example:

cat score/openvino-qwen3-8b-int4-FC/BFCL_v4_simple_python_score.json | head -1
{"accuracy": 0.95, "correct_count": 380, "total_count": 400}

Those results can be compared with the reference from the berkeley leaderbaord.


Note: The same procedure can be used to validate vLLM component. The only needed change would be updating base_url including replacing /v3/ with /v1/.