Agentic AI with OpenVINO Model Server#

OpenVINO Model Server can be used to serve language models for AI Agents. It supports the usage of tools in the context of content generation. It can be integrated with MCP servers and AI agent frameworks. You can learn more about tools calling based on OpenAI API

Here are presented required steps to deploy language models trained for tools support. The diagram depicting the demo setup is below:

The application employing OpenAI agent SDK is using MCP server. It is equipped with a set of tools to providing context for the content generation. The tools can also be used for automation purposes based on input in text format.

Start MCP server with SSE interface#

Linux#

git clone https://github.com/isdaniel/mcp_weather_server
cd mcp_weather_server && git checkout v0.5.0
docker build -t mcp-weather-server:sse .
docker run -d -p 8080:8080 -e PORT=8080 mcp-weather-server:sse uv run python -m mcp_weather_server --mode sse

Note: On Windows the MCP server will be demonstrated as an instance with stdio interface inside the agent application

Start the agent#

Install the application requirements

curl https://raw.githubusercontent.com/openvinotoolkit/model_server/main/demos/continuous_batching/agentic_ai/openai_agent.py -o openai_agent.py
pip install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/main/demos/continuous_batching/agentic_ai/requirements.txt

Make sure nodejs and npx are installed. On ubuntu it would require sudo apt install nodejs npm. On windows, visit https://nodejs.org/en/download. It is needed for the file system MCP server.

Run the agentic application:

Qwen3-8B

python openai_agent.py --query "What is the current weather in Tokyo?" --model Qwen/Qwen3-8B --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather --stream --enable-thinking

python openai_agent.py --query "List the files in folder /root" --model Qwen/Qwen3-8B --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server all

Qwen3-4B

python openai_agent.py --query "What is the current weather in Tokyo?" --model Qwen/Qwen3-4B --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather --stream

python openai_agent.py --query "List the files in folder /root" --model Qwen/Qwen3-4B --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server all

Llama-3.1-8B-Instruct

python openai_agent.py --query "List the files in folder /root" --model meta-llama/Llama-3.1-8B-Instruct --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server all

Mistral-7B-Instruct-v0.3

python openai_agent.py --query "List the files in folder /root" --model mistralai/Mistral-7B-Instruct-v0.3 --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server all --tool_choice required

Llama-3.2-3B-Instruct

python openai_agent.py --query "List the files in folder /root" --model meta-llama/Llama-3.2-3B-Instruct --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server all

Phi-4-mini-instruct

python openai_agent.py --query "What is the current weather in Tokyo?" --model microsoft/Phi-4-mini-instruct --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather

Qwen3-8B-int4-ov

python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-8B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather

OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov

python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Mistral-7B-Instruct-v0.3-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather --tool-choice required

Phi-4-mini-instruct-int4-ov

python openai_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Phi-4-mini-instruct-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather

Note: The tool checking the weather forecast in the demo is making a remote call to a REST API server. Make sure you have internet connection and proxy configured while running the agent.

Note: For more interactive mode you can run the application with streaming enabled by providing --stream parameter to the script. Currently streaming is enabled models using hermes3 tool parser.

You can try also similar implementation based on llama_index library working the same way:

pip install llama-index-llms-openai-like==0.5.3 llama-index-core==0.14.5 llama-index-tools-mcp==0.4.2
curl https://raw.githubusercontent.com/openvinotoolkit/model_server/main/demos/continuous_batching/agentic_ai/llama_index_agent.py -o llama_index_agent.py
python llama_index_agent.py --query "What is the current weather in Tokyo?" --model OpenVINO/Qwen3-8B-int4-ov --base-url http://localhost:8000/v3 --mcp-server-url http://localhost:8080/sse --mcp-server weather --stream --enable-thinking

Testing efficiency in agentic use case#

Using LLM models with AI agents has a unique load characteristics with multi-turn communication and resending bit parts of the prompt as the previous conversation. To simulate such type of load, we should use a dedicated tool multi_turn benchmark.

git clone -b v0.10.2 https://github.com/vllm-project/vllm
cd vllm/benchmarks/multi_turn
pip install -r requirements.txt
sed -i -e 's/if not os.path.exists(args.model)/if 1 == 0/g' benchmark_serving_multi_turn.py
# Testing single client scenario, for example with GPU execution
docker run -d --name ovms --user $(id -u):$(id -g) --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:latest-gpu \
--rest_port 8000 --model_repository_path /models --source_model OpenVINO/Qwen3-8B-int4-ov --enable_prefix_caching true --cache_size 2 --task text_generation --target_device GPU

python benchmark_serving_multi_turn.py -m Qwen/Qwen3-8B --url http://localhost:8000/v3 -i generate_multi_turn.json --served-model-name OpenVINO/Qwen3-8B-int4-ov --num-clients 1 -n 50

# Testing high concurrency, for example on Xeon CPU with constrained resources
docker run -d --name ovms --cpuset-cpus 0-15 --user $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models openvino/model_server:latest --rest_port 8000 --model_repository_path /models --source_model OpenVINO/Qwen3-8B-int4-ov --enable_prefix_caching true --cache_size 20 --task text_generation

python benchmark_serving_multi_turn.py -m Qwen/Qwen3-8B --url http://localhost:8000/v3 -i generate_multi_turn.json --served-model-name OpenVINO/Qwen3-8B-int4-ov --num-clients 24 

Below is an example of the output captured on iGPU:

Parameters:
model=OpenVINO/Qwen3-8B-int4-ov
num_clients=1
num_conversations=100
active_conversations=None
seed=0
Conversations Generation Parameters:
text_files=pg1184.txt
input_num_turns=UniformDistribution[12, 18]
input_common_prefix_num_tokens=Constant[500]
input_prefix_num_tokens=LognormalDistribution[6, 4]
input_num_tokens=UniformDistribution[120, 160]
output_num_tokens=UniformDistribution[80, 120]
----------------------------------------------------------------------------------------------------
Statistics summary:
runtime_sec = 307.569
requests_per_sec = 0.163
----------------------------------------------------------------------------------------------------
                   count     mean      std      min      25%      50%      75%      90%      max
ttft_ms             50.0  1052.97   987.30   200.61   595.29   852.08  1038.50  1193.38  4265.27
tpot_ms             50.0    51.37     2.37    47.03    49.67    51.45    53.16    54.42    55.23
latency_ms          50.0  6128.26  1093.40  4603.86  5330.43  5995.30  6485.20  7333.73  9505.51
input_num_turns     50.0     7.64     4.72     1.00     3.00     7.00    11.00    15.00    17.00
input_num_tokens    50.0  2298.92   973.02   520.00  1556.50  2367.00  3100.75  3477.70  3867.00

Testing accuracy#

Testing model accuracy is critical for a successful adoption in AI application. The recommended methodology is to use BFCL tool like describe in the testing guide. Here is example of the response from the OpenVINO/Qwen3-8B-int4-ov model:

--test-category simple
{"accuracy": 0.9525, "correct_count": 381, "total_count": 400}

--test-category multiple
{"accuracy": 0.89, "correct_count": 178, "total_count": 200}

--test-category parallel
{"accuracy": 0.89, "correct_count": 178, "total_count": 200}

--test-category irrelevance
{"accuracy": 0.825, "correct_count": 198, "total_count": 240}

Models can be also compared using the leaderboard reports.

Agentic AI with OpenVINO Model Server#

Export LLM model#

Export using python script#

Direct pulling of pre-configured HuggingFace models from docker containers#

Direct pulling of pre-configured HuggingFace models on Windows#

Start OVMS#

Deploying on Windows with GPU#

Deploying in a docker container on CPU#

Deploying in a docker container on GPU#

Deploy all models in a single container#

Start MCP server with SSE interface#

Linux#

Start the agent#

Testing efficiency in agentic use case#

Testing accuracy#