OVMS Pull mode#

This documents describes how to leverage OpenVINO Model Server (OVMS) pull feature to automate deployment configuration with Generative AI models. When pulling from OpenVINO organization from HF or when pulling GGUF model no additional steps are required. However, when pulling models outside of the OpenVINO organization you have to install additional python dependencies when using baremetal execution so that optimum-cli is available for ovms executable or build the OVMS python container for docker deployments. In summary you have 3 options:

  • pulling preconfigured models in IR format from OpenVINO organization

  • pulling GGUF models from Hugging Face

  • pulling models with automatic conversion and quantization (requires optimum-cli). Include additional consideration like longer time for deployment and pulling model data (original model) from HF, model memory for conversion, diskspace - described here

Pulling the models#

There is a special mode to make OVMS pull the model from Hugging Face before starting the service:

Required: Docker Engine installed

docker run $(id -u):$(id -g) --rm -v <model_repository_path>:/models:rw openvino/model_server:weekly --pull --source_model <model_name_in_HF> --model_repository_path /models --model_name <external_model_name> --target_device <DEVICE> [--gguf_filename SPECIFIC_QUANTIZATION_FILENAME.gguf] --task <task> [TASK_SPECIFIC_PARAMETERS]

Required: OpenVINO Model Server package - see deployment instructions for details.

ovms --pull --source_model <model_name_in_HF> --model_repository_path <model_repository_path> --model_name <external_model_name> --target_device <DEVICE> [--gguf_filename SPECIFIC_QUANTIZATION_FILENAME.gguf] --task <task> [TASK_SPECIFIC_PARAMETERS]

Note: GGUF format model is only supported with --task text_generation. For list of supported models check blog.

Example for pulling OpenVINO/Phi-3-mini-FastDraft-50M-int8-ov:

Required: Docker Engine installed

docker run $(id -u):$(id -g) --rm -v <model_repository_path>:/models:rw openvino/model_server:latest --pull --source_model "OpenVINO/Phi-3-mini-FastDraft-50M-int8-ov" --model_repository_path /models --model_name Phi-3-mini-FastDraft-50M-int8-ov --task text_generation

Required: OpenVINO Model Server package - see deployment instructions for details.

ovms --pull --source_model "OpenVINO/Phi-3-mini-FastDraft-50M-int8-ov" --model_repository_path /models --model_name Phi-3-mini-FastDraft-50M-int8-ov --task text_generation

Example for pulling GGUF model unsloth/Llama-3.2-1B-Instruct-GGUF with Q4_K_M quantization on baremetal host:

Required: Docker Engine installed

docker run $(id -u):$(id -g) --rm -v <model_repository_path>:/models:rw openvino/model_server:weekly --pull --source_model "unsloth/Llama-3.2-1B-Instruct-GGUF" --model_repository_path /models --model_name unsloth/Llama-3.2-1B-Instruct-GGUF --task text_generation --gguf_filename Llama-3.2-1B-Instruct-Q4_K_M.gguf

Required: OpenVINO Model Server package - see deployment instructions for details.

ovms --pull --source_model "unsloth/Llama-3.2-1B-Instruct-GGUF" --model_repository_path /models --model_name unsloth/Llama-3.2-1B-Instruct-GGUF --task text_generation --gguf_filename Llama-3.2-1B-Instruct-Q4_K_M.gguf

It will prepare all needed configuration files to support LLMS with OVMS in the model repository. Check parameters page for detailed descriptions of configuration options and parameter usage.

In case you want to setup model and start server in one step follow instructions on this page.

Note: When using pull mode you need both read and write access rights to models repository.