Loading GGUF models in OVMS#
This demo shows how to deploy GGUF model with the OpenVINO Model Server.
Currently supported models are DeepSeek-R1-Distill-Qwen (1.5B, 7B), Qwen2.5 Instruct (1.5B, 3B, 7B), llama-3.2 Instruct (1B, 3B) & llama-3.1-8B. Check the list of supported models for more details.
If the model already exists locally, it will skip the downloading and immediately start the serving.
NOTE: Optionally, to only download the model and omit the serving part, use
--pull
parameter and remove--rest_port
.
Deploy the model:
Start docker container:
mkdir models
docker run -d --rm --user $(id -u):$(id -g) -p 8000:8000 -v $(pwd)/models:/models/:rw \
-e http_proxy=$http_proxy -e https_proxy=$https_proxy -e no_proxy=$no_proxy \
openvino/model_server:weekly \
--rest_port 8000 \
--model_repository_path /models/ \
--task text_generation \
--source_model "Qwen/Qwen2.5-3B-Instruct-GGUF" \
--gguf_filename qwen2.5-3b-instruct-q4_k_m.gguf \
--model_name Qwen/Qwen2.5-3B-Instruct
mkdir models
ovms --rest_port 8000 ^
--model_repository_path /models/ ^
--task text_generation ^
--source_model "Qwen/Qwen2.5-3B-Instruct-GGUF" ^
--gguf_filename qwen2.5-3b-instruct-q4_k_m.gguf ^
--model_name Qwen/Qwen2.5-3B-Instruct
NOTE: If you want to use model that is split into several
.gguf
files, you should specify the filename of the first part only, e.g.--gguf_filename model-name-00001-of-00002.gguf
.
Then send a request to the model:
curl http://localhost:8000/v3/chat/completions -s -H "Content-Type: application/json" \
-d '{"model": "Qwen/Qwen2.5-3B-Instruct", "stream":false, "messages": [{"role": "system","content": "You are a helpful assistant."}, {"role": "user","content": "What is the capital of France in one word?"}]}' \
| jq .
Example response would be:
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"logprobs": null,
"message": {
"content": "Paris",
"role": "assistant",
"tool_calls": []
}
}
],
"created": 1756986130,
"model": "Qwen/Qwen2.5-3B-Instruct",
"object": "chat.completion",
"usage": {
"prompt_tokens": 54,
"completion_tokens": 1,
"total_tokens": 55
}
}
NOTE: Model downloading feature is described in depth in separate documentation page: Pulling HuggingFaces Models.